# AfroTech Company

# Business Understanding

AfroTech Company is developing a movie recommendation engine for streaming platforms to enhance user experience through personalized content suggestions. With the rapid expansion of digital movie libraries, users often struggle to discover content aligned with their interests. This leads to decision fatigue, reduced engagement, and increased churn, while valuable content remains underutilized.

To address this, AfroTech’s system will use collaborative filtering techniques—specifically Singular Value Decomposition (SVD) and Least Squares Optimization—to uncover latent patterns in user preferences and movie features. The model will predict unseen ratings and generate relevant recommendations, helping users find content more efficiently.

The primary business objectives are to improve user satisfaction, increase engagement, enhance content discovery, and support long-term user retention through personalized experiences.

Success will be measured by:
- Achieving an RMSE of 0.90 or lower (or MAE <= 0.70) for accurate predictions,
- Reaching at least 95% coverage of users and movies,
- Maintaining a click-through rate of 20% or higher on recommendations,
- Ensuring an average time to first watch under three minutes,
- Achieving 70% user retention over 30 days,
- And maintaining a cosine similarity of 0.85 or higher between latent feature vectors across training runs.

These outcomes will demonstrate the system’s ability to deliver accurate, scalable, and user-focused recommendations that align with AfroTech’s business goals.



# Data Understanding

In [2]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate, train_test_split
import warnings
warnings.filterwarnings('ignore')


explanation of the libraries!:
- surprise is the library for recommender systems
- Dataset is for the datasets, duh!
- Reader defines the rating scale [1-5]
- cross_validate checks model perfomance(hii tushasoma sistee)

In [3]:
# #load dataset (movie lens )
# #it's built in the surprise library
# r = Dataset.load_builtin('ml-100k')

# d = pd.DataFrame(r.raw_ratings, columns=['user_id','movie_id','rating','timestamp'])
# d[:3]

In [4]:
#load the datasets
links = pd.read_csv('ml-latest-small/links.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')
print(f'Movies:\n{movies[:3]}\n')
print(f'Links:\n{links[:3]}\n')
print(f'Ratings:\n{ratings[:3]}\n')
print(f'Tags:\n{tags[:3]}\n')

Movies:
   movieId                    title  \
0        1         Toy Story (1995)   
1        2           Jumanji (1995)   
2        3  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  

Links:
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0

Ratings:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224

Tags:
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992



The datasets contains columns:
- ratings.csv: userId, movieId, rating, timestamp -> this is the core data for matrix factorization.
- movies.csv: movieId, title, genres -> essential for naming recommendations and possible genre analysis.
- tags.csv: userId, movieId, tag, timestamp -> optional for content-based hybrid recommendations (could enrich model later).
- links.csv: movieId, imdbId, tmdbId -> useful if you want to pull external metadata (optional for now) for deployment.

Observation:
- All datasets have a column 'movieId'
- in data cleaning we will have all the column names in lowecase

# Data Cleaning

1. Ratings

In [5]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [6]:
#check data types, shape, duplicates and missing values in ratings
print(f'The datatypes are:\n{ratings.dtypes}\n')
print(f'Ratings has {ratings.shape[0]} rows and {ratings.shape[1]}columns\n')
print(f'Ratings has:\n{ratings.isna().sum()} missing values\n')
print(f'Ratings has:{ratings.duplicated().sum()} duplicates')

# have all column names in lowercase
ratings.columns = ratings.columns.str.lower()
print(ratings.columns)

The datatypes are:
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

Ratings has 100836 rows and 4columns

Ratings has:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64 missing values

Ratings has:0 duplicates
Index(['userid', 'movieid', 'rating', 'timestamp'], dtype='object')


Observations:
- Timestamp will be useful in EDA but we may have to drop it in modelling
- All dtypes for the other datasets MUST be int for easy merging

2. Movies

In [7]:
#check data types, shape, duplicates and missing values in movies
print(f'The datatypes are:\n{movies.dtypes}\n')
print(f'Movies has {movies.shape[0]} rows and {movies.shape[1]}columns\n')
print(f'Movies has:\n{movies.isna().sum()} missing values\n')
print(f'Movies has:{movies.duplicated().sum()} duplicates')

# have all column names in lowercase
movies.columns = movies.columns.str.lower()
print(movies.columns)

The datatypes are:
movieId     int64
title      object
genres     object
dtype: object

Movies has 9742 rows and 3columns

Movies has:
movieId    0
title      0
genres     0
dtype: int64 missing values

Movies has:0 duplicates
Index(['movieid', 'title', 'genres'], dtype='object')


3. Links

In [8]:
#check data types, shape, duplicates and missing values in links
print(f'The datatypes are:\n{links.dtypes}\n')
print(f'links has {links.shape[0]} rows and {links.shape[1]}columns\n')
print(f'links has:\n{links.isna().sum()} missing values\n')
print(f'links has:{links.duplicated().sum()} duplicates')

# have all column names in lowercase
links.columns = links.columns.str.lower()
print(links.columns)

The datatypes are:
movieId      int64
imdbId       int64
tmdbId     float64
dtype: object

links has 9742 rows and 3columns

links has:
movieId    0
imdbId     0
tmdbId     8
dtype: int64 missing values

links has:0 duplicates
Index(['movieid', 'imdbid', 'tmdbid'], dtype='object')


In [9]:
# drop the missing values in tmbdid
links = links.dropna()
links.isna().sum()
print(links.shape)

(9734, 3)


4. tags

In [10]:
#check data types, shape, duplicates and missing values in tags
print(f'The datatypes are:\n{tags.dtypes}\n')
print(f'tags has {tags.shape[0]} rows and {tags.shape[1]}columns\n')
print(f'tags has:\n{tags.isna().sum()} missing values\n')
print(f'tags has:{tags.duplicated().sum()} duplicates')

# have all column names in lowercase
tags.columns = tags.columns.str.lower()

The datatypes are:
userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

tags has 3683 rows and 4columns

tags has:
userId       0
movieId      0
tag          0
timestamp    0
dtype: int64 missing values

tags has:0 duplicates


In [27]:
tag_characters = tags['tag'].value_counts()
tag_characters

tag
In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
funny                 23
                    ... 
small towns            1
In Your Eyes           1
Lloyd Dobbler          1
weak plot              1
Heroic Bloodshed       1
Name: count, Length: 1589, dtype: int64

Observation:
- The tags dataset has the least amount of rows compared to the other 3 datasets
 * Upside:
  - It may be useful to use when recommending using content-based features
    - eg-> when recommending 'thrillers' to a user who rated a thriller highly
  - It makes it easy to explain why we recommend a certain movie
    - eg-> recommended because you liked horror movies
  - it makes the recommendation diverse since it combines latent factors with tag-based similarities
  - We could cluster the tags to group movies by themes (cosine similarities)
* Downside:
  - The dataset is too small comapred to ratings with more than 10k entries
  - Since tags are user generated, some of the entries may lack meaning to us
    - eg-> a users entry like 'beautiful' may not be meaningful information to us
- Due to lack of quality and sizable info from tags, we will hae to let it go (not merge)
- Movie genres will have to work in place of tags

### Merging the datasets

In [29]:
#The most important dataset is ratings
rate_movie = pd.merge(ratings,movies, on='movieid',how='left')
rate_movie[:3]

Unnamed: 0,userid,movieid,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller


In [30]:
#merge rate_movie and links
df = pd.merge(rate_movie,links,how='left',on='movieid')
df[:3]

Unnamed: 0,userid,movieid,rating,timestamp,title,genres,imdbid,tmdbid
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709.0,862.0
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,113228.0,15602.0
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,113277.0,949.0


In [31]:
# Sanity check
print("Ratings:",ratings.shape)
print("Movies:",movies.shape)
print("Links:",links.shape)
print("Merged Dataset:",df.shape)

Ratings: (100836, 4)
Movies: (9742, 3)
Links: (9734, 3)
Merged Dataset: (100836, 8)


In [None]:
# check for missing values & drop them
df.isna().sum()
df.dropna()
df.isna().sum()
 # Dropped the 13 missing rows in imdbid and tmdbid

userid       0
movieid      0
rating       0
timestamp    0
title        0
genres       0
imdbid       0
tmdbid       0
dtype: int64

In [None]:
#check for duplicates
print(f"The dataset has{df.duplicated().sum()} duplicates")

The dataset has:0 duplicates


In [40]:
df.dtypes

userid         int64
movieid        int64
rating       float64
timestamp      int64
title         object
genres        object
imdbid       float64
tmdbid       float64
dtype: object

In [None]:
# make a copy of the merged dataset
.to_csv()

# Exploratory Data Analysis

In [6]:

#instantiate svd
algo = SVD()

#split using cross validation
cross_validate(algo,data,measures=['RMSE','MSE'],cv=5,verbose=5)

Evaluating RMSE, MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9412  0.9374  0.9370  0.9322  0.9321  0.9360  0.0035  
MSE (testset)     0.8859  0.8787  0.8779  0.8689  0.8688  0.8760  0.0065  
Fit time          0.40    0.41    0.41    0.41    0.41    0.41    0.00    
Test time         0.05    0.05    0.05    0.05    0.09    0.06    0.02    


{'test_rmse': array([0.94123365, 0.93737717, 0.93695163, 0.93215321, 0.93207609]),
 'test_mse': array([0.88592078, 0.87867596, 0.87787835, 0.8689096 , 0.86876583]),
 'fit_time': (0.4030900001525879,
  0.41318297386169434,
  0.41394591331481934,
  0.41162705421447754,
  0.41321492195129395),
 'test_time': (0.04991793632507324,
  0.046463966369628906,
  0.04782509803771973,
  0.04556012153625488,
  0.08956193923950195)}

Preach babygirl:
- the default parameters for SVD are:
1. n_factors=100 number of latent features(Hidden patterns)
2. n_epochs=20   number of iterations.
3. lr_all=0.005  learning rate.
4. reg_all=0.02  regularization (controls overfitting)

- We split the dataset to 5 folds; train 4 and test 1. Then repeat 5 times
- Our RMSE are approx 0.9 meaning our predictions are on avg 0.9 stars off from true ratings 

In [7]:
#train and predict the dataset
#train
trainset = data.build_full_trainset()
algo.fit(trainset)

#predict rating for user 196 on item 302
pred = algo.predict(uid=196,iid=302)
print(pred)

user: 196        item: 302        r_ui = None   est = 3.53   {'was_impossible': False}


Hubiri mwandada:
- We build the trainset using all available ratings
- Fit learns all user and items latent factors
- uid: user id
- iid: item id
- r_ui = None: rating from user id missing
- we estimate the rating will be 3.53
- 'was_impossible':False shows the prediction was possible
observation:
- We predict that user 196 will rate movie 302 3.53

## Making recommendations

In [10]:
#getting all movie ids
all_items = trainset.all_items()
all_items_ids = [trainset.to_raw_iid(iid) for iid in all_items]

#recommend top 5 movies to user 196
user_id = 196
predictions = [algo.predict(user_id,iid) for iid in all_items_ids]

#sort by estimated ratings
top_5 = sorted(predictions, key=lambda x: x.est, reverse=True)[:5]
[(pred.iid, round(pred.est, 2)) for pred in top_5]

[('169', 4.59), ('318', 4.58), ('408', 4.58), ('64', 4.53), ('483', 4.53)]

# Executive Summary

This project presents a personalized movie recommendation system developed for AfroTech Company, a streaming technology provider aiming to improve user engagement and content discovery. The system is designed to address the challenges users face in finding relevant content within large movie libraries.

Using collaborative filtering methods—specifically Singular Value Decomposition (SVD) combined with Least Squares Optimization—the system analyzes historical user ratings to uncover latent preferences and predict unseen ratings. This approach enables AfroTech to deliver data-driven, tailored movie recommendations that align with individual user tastes.

The project follows a structured pipeline including data preparation, matrix construction, model training, and evaluation. Success is measured through key performance indicators such as prediction accuracy (RMSE and MAE), recommendation coverage, and simulated engagement metrics like click-through rate and retention.

By the end of this project, AfroTech will have a functional and scalable recommendation engine prototype that supports its goal of enhancing the user experience through intelligent, personalized recommendations.
