## Netflix Movie Recommendation
### This dataset consist of 10,000 records with 9 columns. 

With a strong feeling of excitement and dedication to improving the streaming experience for viewers, I examined Netflix data and created movie recommendation systems. I got interested in data analysis and recommendation systems for the following reasons:

- User-Centric Focus: My main driving force was to improve and personalize the Netflix customer experience. Customized movie recommendations, in my opinion, can save consumers time when looking for something to watch while also assisting them in discovering stuff they enjoy.

- Effect on Decision-Making: I am aware that data analysis enables companies to take well-informed positions. I am able to deliver insights that affect content creation, licensing, and the platform's general strategy by looking at the tastes and behaviors of viewers.

- Passion for Movies: I'm a die-hard movie fan who loves the magic of cinema storytelling. My motivation stems from the chance to present audiences both underappreciated films and recent releases that suit their preferences.

- Data-Driven Decision-Making: Data has a lot of power, in my opinion. Making decisions based on data is the cornerstone of my strategy. I can assist Netflix in understanding what viewers want, when they want it, and why they want it by examining user data.

In conclusion, I constructed movie recommendation systems and examined Netflix data since I'm committed to improving user streaming experiences, boosting content recommendations, and assisting the entertainment sector. My love of movies and my analytical abilities enable me to offer insightful commentary and support Netflix's goal of providing outstanding entertainment to millions of users worldwide.


In [1]:
#Import important libraries to perform EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## DATA CLEANING

In [2]:
#Import dataset which is in CSV format
netflix_df = pd.read_csv('movies.csv')
netflix_df.head(5)

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,


### Features Explanation

-  MOVIES: Title of movie
-  YEAR: Release year
-  GENRE: Genre of the Movie/ TV Shows
-  RATING: Viewers opionion on movies viewed
-  ONE-LINE: A short description about the movie before watching
-  STARS: The casting
-  VOTES: Viewers putting their opinion to count
-  RunTime: The duration of the movie
-  Gross: Revenue on movie / tv shows

In [3]:
#Getting information to see if their are nulls
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MOVIES    9999 non-null   object 
 1   YEAR      9355 non-null   object 
 2   GENRE     9919 non-null   object 
 3   RATING    8179 non-null   float64
 4   ONE-LINE  9999 non-null   object 
 5   STARS     9999 non-null   object 
 6   VOTES     8179 non-null   object 
 7   RunTime   7041 non-null   float64
 8   Gross     460 non-null    object 
dtypes: float64(2), object(7)
memory usage: 703.2+ KB


In [4]:
#checking for total missing values by number and percentage
print("Missing Values:\n")
for col in netflix_df.columns:
    missing = netflix_df[col].isna().sum()
    percent = missing / netflix_df.shape[0] * 100
    print("%s: %.2f%% (%d)" % (col,percent,missing))

Missing Values:

MOVIES: 0.00% (0)
YEAR: 6.44% (644)
GENRE: 0.80% (80)
RATING: 18.20% (1820)
ONE-LINE: 0.00% (0)
STARS: 0.00% (0)
VOTES: 18.20% (1820)
RunTime: 29.58% (2958)
Gross: 95.40% (9539)


### Removing unwanted features
In Genre, On-line and Stars column, "\n" is showing althrough the columns without adding any meaning to the record, hence the need for removal.

In [5]:
# Removing "\n" from GENRE, ONE-LINE, and STARS columns
for col in ['GENRE','ONE-LINE','STARS']:
    netflix_df[col] = netflix_df[col].str.replace("\n","").str.strip()

netflix_df.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth| Stars:Peri Baume...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,
3,Rick and Morty,(2013– ),"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,
4,Army of Thieves,(2021),"Action, Crime, Horror",,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer| Stars:Matt...,,,


In [6]:
#Splitting Stars column into Director and Stars Column
def extract_director(direc):
    if 'Director' in direc or 'Directors' in direc:
        director = direc.strip().split("|")[0] # The Second Half is the stars
        return director.split(":")[1] # Return the Director name
    else:
        return ''

def extract_stars(stars):
    if 'Star' not in stars or 'Stars' not in stars:
        return ''
    else:
        return stars.split(":")[-1] # last value in this list will be the stars

netflix_df['Director'] = netflix_df['STARS'].apply(lambda d: extract_director(d))
netflix_df['Stars'] = netflix_df['STARS'].apply(lambda s: extract_stars(s))

# View head of these columns
netflix_df[['STARS','Director','Stars']].head()

Unnamed: 0,STARS,Director,Stars
0,Director:Peter Thorwarth| Stars:Peri Baume...,Peter Thorwarth,"Peri Baumeister, Carl Anton Koch, Alexander Sc..."
1,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",,"Chris Wood, Sarah Michelle Gellar, Lena Headey..."
2,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",,"Andrew Lincoln, Norman Reedus, Melissa McBride..."
3,"Stars:Justin Roiland, Chris Parnell, Spencer G...",,"Justin Roiland, Chris Parnell, Spencer Grammer..."
4,Director:Matthias Schweighöfer| Stars:Matt...,Matthias Schweighöfer,"Matthias Schweighöfer, Nathalie Emmanuel, Ruby..."


In [7]:
#checking for total missing values by number and percentage again to view director and stars
print("Missing Values:\n")
for col in netflix_df.columns:
    missing = netflix_df[col].isna().sum()
    percent = missing / netflix_df.shape[0] * 100
    print("%s: %.2f%% (%d)" % (col,percent,missing))

Missing Values:

MOVIES: 0.00% (0)
YEAR: 6.44% (644)
GENRE: 0.80% (80)
RATING: 18.20% (1820)
ONE-LINE: 0.00% (0)
STARS: 0.00% (0)
VOTES: 18.20% (1820)
RunTime: 29.58% (2958)
Gross: 95.40% (9539)
Director: 0.00% (0)
Stars: 0.00% (0)


In [8]:
netflix_df.describe()

Unnamed: 0,RATING,RunTime
count,8179.0,7041.0
mean,6.921176,68.688539
std,1.220232,47.258056
min,1.1,1.0
25%,6.2,36.0
50%,7.1,60.0
75%,7.8,95.0
max,9.9,853.0


### Univariate analysis of some features

##### Genre

In [9]:
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
init_notebook_mode(connected=True)

netflix_df_genre = netflix_df['GENRE'].value_counts().reset_index().rename(columns={'GENRE':'Count','index':'Genre'})

fig = px.bar(data_frame = netflix_df_genre.sort_values(by='Count',ascending = False).head(10),
             x = 'Genre', y = 'Count')
colors = ['darkred'] * 10
fig.update_traces(marker_color = colors)
fig.update_layout(title = 'Top 10 Genre Combination')

fig.show()

In [10]:
# Count number of Genre
from collections import Counter

genre_raw = netflix_df['GENRE'].dropna().to_list()
genre_list = list()

for genres in genre_raw:
    genres = genres.split(", ")
    for g in genres:
        genre_list.append(g)
        
genre_df = pd.DataFrame.from_dict(Counter(genre_list), orient = 'index').rename(columns = {0:'Count'})

In [11]:
# Genre Count Ditribution
fig = px.pie(data_frame = genre_df,
             values = 'Count',
             names =genre_df.index,
             color_discrete_sequence = px.colors.qualitative.Safe)

fig.update_traces(textposition = 'inside',
                  textinfo = 'label+percent',
                  pull = [0.05] * len(genre_df.index.to_list()))

fig.update_layout(title = {'text':'Genre Distribution'},
                  legend_title = 'Gender',
                  uniformtext_minsize=13,
                  uniformtext_mode='hide',
                  font = dict(
                      family = "Courier New, monospace",
                      size = 18,
                      color = 'black'
                  ))


fig.show()

##### Rating

In [12]:
fig = px.bar(data_frame = netflix_df['RATING'].value_counts().reset_index().head(10),
             x = 'index', y = 'RATING',
             title = 'Rating Distribution')

fig.update_yaxes(title = 'Count')

fig.update_xaxes(type ='category',
                 title = 'Rating (out of 10)')
colors = ['darkred'] * 10
fig.update_traces(marker_color = colors)

fig.show()

###### Best Rated Movie

In [13]:
#Best rated titles

netflix_df.sort_values(["RATING", "MOVIES"], ascending=False).head(2)

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Director,Stars
7640,BoJack Horseman,(2014–2020),"Animation, Comedy, Drama",9.9,BoJack reconnects with faces from his past.,"Director:Amy Winfrey| Stars:Will Arnett, A...",12369,26.0,,Amy Winfrey,"Will Arnett, Amy Sedaris, Alison Brie, Paul F...."
8510,Avatar: The Last Airbender,(2005–2008),"Animation, Action, Adventure",9.9,Aang's moment of truth arrives. Can he defeat ...,Director:Joaquim Dos Santos| Stars:Zach Ty...,8813,92.0,,Joaquim Dos Santos,"Zach Tyler, Mae Whitman, Jack De Sena, Michael..."


In [14]:
#Worst rated titles

netflix_df.sort_values(["RATING", "MOVIES"], ascending=True).head(2)

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Director,Stars
1166,Raketsonyeondan,(2021– ),"Comedy, Drama, Sport",1.1,A city kid is brought to the countryside by hi...,"Stars:Kim Sang-kyung, Na-ra Oh, Tang Joon-sang...",25629,80.0,,,"Kim Sang-kyung, Na-ra Oh, Tang Joon-sang, Sang..."
5365,Defcon 2012,(2010),Sci-Fi,1.8,"On October 30, 2009 an independent filmmaker a...",Director:R. Christian Anderson| Stars:Shy ...,377,92.0,,R. Christian Anderson,"Shy Pilgreen, Dan Gruenberg, Brian Neil Hoff, ..."


#### Gross

In [15]:
gross_df = netflix_df[~netflix_df['Gross'].isna()] # New Dataframe with no NaN in Gross column

# Extract the numerical value
def extract_gross(gross):
    return float(gross.replace("$","").replace("M",""))

# Unit is Million US Dollar
gross_df['Gross'] = gross_df['Gross'].apply(lambda g: extract_gross(g))

# Highest Gross Movie
print("Highest Gross movie:",gross_df.iloc[gross_df['Gross'].argmax()]['MOVIES'])

Highest Gross movie: Beauty and the Beast




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [25]:
fig = px.bar(data_frame = gross_df.sort_values(by='Gross', ascending = False).head(10),
             x = 'MOVIES', y = 'Gross',
             title = 'Top 10 Gross Movie')
fig.update_layout(yaxis_title = 'Million $')
colors = ['darkred'] * 10
fig.update_traces(marker_color = colors)
fig.show()

### Movie Recommendation

In [17]:
# Features  using GENRE, RATING??, ONE-LINE, RunTime??, Director, Stars

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

features = ['GENRE','ONE-LINE','Director','Stars']

# Filling in missing values with Blank String
for feature in features:
    netflix_df[feature] = netflix_df[feature].fillna("")

netflix_df['combined_features'] = netflix_df['GENRE'] + " " + netflix_df['ONE-LINE'] + " " + netflix_df['Director'] + " " + netflix_df['Stars'] 
cv = CountVectorizer()
count_matrix = cv.fit_transform(netflix_df['combined_features'])
cosine_sim = cosine_similarity(count_matrix)

In [18]:
# Function for movie recommendation
def movie_recommendation(mov,sim_num = 5):

    user_choice = mov
    
    try:
        ref_index = netflix_df[netflix_df['MOVIES'].str.contains(user_choice, case = False)].index[0]

        similar_movies = list(enumerate(cosine_sim[ref_index]))

        sorted_simmilar_movies = sorted(similar_movies, key = lambda x: x[1], reverse = True)[1:]

        print('\nRecomended Movies for [{}]'.format(user_choice))
        print('-'*(24 + len(user_choice)))

        for i, element in enumerate(sorted_simmilar_movies):
            similar_movie_id = element[0]
            similar_movie_title = netflix_df['MOVIES'].iloc[similar_movie_id]
            s_score = element[1]
            print('{:40} -> {:.3f}'.format(similar_movie_title, s_score))

            if i > sim_num:
                break
    except IndexError:
        print("\n[{}] is not in our database!".format(user_choice))
        print("We couldn't recommend anyting...Sorry...")

In [19]:
# Search for movie with the keyword
def movie_available(key):
    
    keyword = key
    
    print("Movie with keyword: [{}]".format(keyword))
    
    for i, mov in enumerate(netflix_df[netflix_df['MOVIES'].str.contains(keyword)]['MOVIES'].to_list()):
        print("{}) {} ".format(i+1,mov))

In [20]:
# Running the Function
movie_available("Beauty")

Movie with keyword: [Beauty]
1) Beauty and the Beast 
2) Beauty and the Beast 
3) Beauty 
4) Chasing Beauty 


In [21]:
# Running the Function 
movie_recommendation("The Walking Dead")


Recomended Movies for [The Walking Dead]
----------------------------------------
Gerald's Game                            -> 0.367
Hija única                               -> 0.358
Vampire in the Garden                    -> 0.349
Exception                                -> 0.345
The Queen                                -> 0.345
Prison Break                             -> 0.343
Sonora                                   -> 0.342


In [23]:
# Running the Function with argument
movie_recommendation("The Walking Dead",7)


Recomended Movies for [The Walking Dead]
----------------------------------------
Gerald's Game                            -> 0.367
Hija única                               -> 0.358
Vampire in the Garden                    -> 0.349
Exception                                -> 0.345
The Queen                                -> 0.345
Prison Break                             -> 0.343
Sonora                                   -> 0.342
Neverlake                                -> 0.340
Ghosts of Sugar Land                     -> 0.334
