# Movie Recommendation System

## Objective:
The goal of this project is to develop a movie recommendation system based on user data and preferences. By utilizing a variety of characteristics, the system will be able to suggest movies that are tailored to individual users.

### Key Features:
- **Personalized Recommendations**: Tailor movie suggestions based on the following user characteristics:
    - Age
    - Movie genre preferences
    - Viewing history
    - Ratings given to previous movies
    - Popularity trends among users with similar tastes

### Approach:
This system leverages data analysis and machine learning techniques to:
- Understand patterns in user behavior and preferences
- Identify relationships between movie characteristics and user tastes
- Provide dynamic recommendations that adapt as user preferences evolve

### Technology Stack:
- **Python**: Core programming language for data handling and logic implementation
- **Pandas & NumPy**: For data manipulation and analysis
- **Scikit-learn**: Machine learning library for model building and evaluation
- **Flask**: Web framework to deploy the recommendation system as an interactive application


### Databases / DataSources
- IMDB dataset's

In [4]:
# packages
import pandas as pd
import os
import numpy as np

In [23]:
# Class of Extraction
# create a function to read a tsv file

class Extract:
    def __init__(self, dir_path=None):
        # Set the directory path to 'data' folder
        if dir_path is None:
            # Use the 'data' folder inside the current directory
            self.dir_path = os.path.join(os.getcwd(), '..', 'data')
        else:
            self.dir_path = dir_path

    def read_tsv(self, filename: str):
        '''Function to read all TSV files. This function uses Pandas to extract the TSV data.'''
        # Construct full file path
        file_path = os.path.join(self.dir_path, filename)
        
        # Read the TSV file with pandas
        df = pd.read_csv(file_path, delimiter='\t', low_memory=False)
        
        return df
    
# Instance Extract class
extract_inst = Extract()

## title.akas.tsv.gz
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title.One or more of the following: "alternative","dvd",  "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title


In [12]:
# Create a Dataframe
title_akas = extract_inst.read_tsv('title_akas.tsv')

In [20]:
# With that we came a conclusion that are not nullable data inside this dataset
title_akas.describe()

Unnamed: 0,ordering,isOriginalTitle
count,49929820.0,49929820.0
mean,4.289335,0.2223854
std,3.982185,0.4158487
min,1.0,0.0
25%,2.0,0.0
50%,4.0,0.0
75%,6.0,0.0
max,251.0,1.0


In [17]:
# Let's do some dig'in 
# find some movies with the word 'cars' in it

title_akas.loc[title_akas['title'].str.contains('cars') == True].head(5)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
22755,tt0007305,1,Scars and Stripes Forever,\N,\N,original,\N,1
22756,tt0007305,2,Scars and Stripes Forever,US,\N,imdbDisplay,\N,0
29048,tt0009088,5,Scars and Stripes,US,\N,working,\N,0
52588,tt0014444,1,Scars of Hate,\N,\N,original,\N,1
52589,tt0014444,2,Scars of Hate,US,\N,imdbDisplay,\N,0


In [18]:
# Now with the word Man
title_akas.loc[title_akas['title'].str.contains('man') == True].head(5)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
20,tt0000003,5,Sarmanul Pierrot,RO,\N,imdbDisplay,\N,0
499,tt0000075,7,The Conjuring of a Woman at the House of Rober...,US,\N,\N,\N,0
514,tt0000080,1,Grandes manoeuvres,\N,\N,original,\N,1
515,tt0000080,2,Grandes manoeuvres,FR,\N,imdbDisplay,\N,0
554,tt0000091,1,Le manoir du diable,\N,\N,original,\N,1


## title.basics.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. '\N' for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

In [24]:
title_basics = extract_inst.read_tsv(
    'title_basics.tsv'
)

In [26]:
# let's see all the types of videos we have in the titleType column
title_basics['titleType'].unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvEpisode', 'tvSeries',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [28]:
# im really curios about this video and videogame over here
title_basics.loc[title_basics['titleType'] == 'videoGame'].head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
82536,tt0084376,videoGame,"MysteryDisc: Murder, Anyone?","MysteryDisc: Murder, Anyone?",0,1982,\N,\N,"Adventure,Crime,Mystery"
84091,tt0085982,videoGame,MysteryDisc: Many Roads to Murder,MysteryDisc: Many Roads to Murder,0,1983,\N,\N,"Adventure,Crime,Mystery"
102647,tt0105000,videoGame,Night Trap,Night Trap,0,1992,\N,\N,"Adventure,Horror,Mystery"
107371,tt0109865,videoGame,Gabriel Knight: Sins of the Fathers,Gabriel Knight: Sins of the Fathers,0,1993,\N,\N,"Adventure,Drama,Horror"
107763,tt0110267,videoGame,King's Quest VII: The Princeless Bride,King's Quest VII: The Princeless Bride,0,1994,\N,\N,"Adventure,Fantasy"


In [29]:
title_basics.loc[title_basics['titleType'] == 'video'].head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
65094,tt0066435,video,Take It Out in Trade: The Outtakes,Take It Out in Trade: The Outtakes,0,1995,\N,69,Documentary
71782,tt0073303,video,Queen: Live at the Rainbow,Queen: Live at the Rainbow,0,1992,\N,83,"Documentary,Musical"
75545,tt0077178,video,Art Game,Art Game,0,1978,\N,12,Short
77324,tt0078999,video,Coriolanus,Coriolanus,0,1979,\N,160,"Drama,History"
77592,tt0079275,video,He Did It,He Did It,0,1979,\N,\N,\N


In [33]:
title_basics['titleType'].unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvEpisode', 'tvSeries',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [57]:
# okay so now we going to separate Long and short movies from tvseries and tvshows and from games and video
tvseries = ['tvEpisode', 'tvSeries','tvMiniSeries', 'tvSpecial','tvPilot','tvShort']
movies = ['movie','short','TvMovie']
video_games = ['video','videoGames']

# Using the Loc to filter all the dataframe
title_basics_tvseries = title_basics.loc[title_basics['titleType'].isin(tvseries)]
title_basics_movies = title_basics.loc[title_basics['titleType'].isin(movies)]
title_basics_videogames = title_basics.loc[title_basics['titleType'].isin(video_games)]

In [58]:
# Let's start cleaning the data movies
title_basics_movies.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [60]:
# drop the endYear column
title_basics_movies = title_basics_movies.drop(columns=['endYear']).copy()

In [64]:
# now split the genres 
# create three columns base_genre, genre1 and genre2
title_basics_movies[['base_genre','genre1','genre2']] = title_basics_movies['genres'].str.split(',',expand=True)

In [68]:
# change types of columns
title_basics_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1714661 entries, 0 to 11136777
Data columns (total 11 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   runtimeMinutes  object
 7   genres          object
 8   base_genre      object
 9   genre1          object
 10  genre2          object
dtypes: object(11)
memory usage: 157.0+ MB


In [79]:
# let's seen if we can convert easily the StartYearColumn
title_basics_movies['startYear'].unique()

array(['1894', '1892', '1893', '1895', '1896', '1898', '1897', '1900',
       '1899', '1901', '1902', '1903', '1905', '1904', '1912', '1907',
       '1906', '1908', '1910', '1909', '\\N', '1911', '1990', '1914',
       '1913', '1915', '1919', '1916', '1917', '1918', '1936', '1925',
       '1922', '1920', '1921', '1923', '1924', '1928', '2019', '2021',
       '1926', '1927', '1929', '2000', '1993', '1935', '1930', '1942',
       '1934', '1931', '1939', '1932', '1937', '1933', '1950', '1938',
       '1951', '1945', '1946', '1940', '1944', '1949', '1947', '1943',
       '1941', '1952', '1957', '1959', '1948', '2001', '1953', '1954',
       '1965', '1983', '1980', '1973', '1961', '1995', '1964', '1958',
       '1955', '1956', '1962', '1960', '1977', '2012', '1967', '1968',
       '2007', '1963', '1988', '1971', '1969', '1972', '1966', '1970',
       '2023', '1976', '2016', '2020', '1979', '1978', '1981', '2006',
       '1975', '2014', '1989', '1974', '1986', '1987', '2015', '2010',
       

In [None]:
# replace \\N
title_basics_movies

In [None]:
def replace(df,column,repr,repl):
    ''' Replace all values inside dataframe columns'''
    df[]

In [70]:
# Create a function to change_column type
def change_type(df,colum, type=['int64','str','float64']):
    df[colum] = df[colum].astype(
        type
    )

In [74]:
# Now create a list of all columns need to change types
to_int = ['isAdult','startYear','runtimeMinutes']

In [75]:
for int_ in to_int:
    change_type(title_basics_movies,int_,type='int64')

ValueError: invalid literal for int() with base 10: '\\N'

array(['1894', '1892', '1893', '1895', '1896', '1898', '1897', '1900',
       '1899', '1901', '1902', '1903', '1905', '1904', '1912', '1907',
       '1906', '1908', '1910', '1909', '\\N', '1911', '1990', '1914',
       '1913', '1915', '1919', '1916', '1917', '1918', '1936', '1925',
       '1922', '1920', '1921', '1923', '1924', '1928', '2019', '2021',
       '1926', '1927', '1929', '2000', '1993', '1935', '1930', '1942',
       '1934', '1931', '1939', '1932', '1937', '1933', '1950', '1938',
       '1951', '1945', '1946', '1940', '1944', '1949', '1947', '1943',
       '1941', '1952', '1957', '1959', '1948', '2001', '1953', '1954',
       '1965', '1983', '1980', '1973', '1961', '1995', '1964', '1958',
       '1955', '1956', '1962', '1960', '1977', '2012', '1967', '1968',
       '2007', '1963', '1988', '1971', '1969', '1972', '1966', '1970',
       '2023', '1976', '2016', '2020', '1979', '1978', '1981', '2006',
       '1975', '2014', '1989', '1974', '1986', '1987', '2015', '2010',
       

In [77]:
title_basics_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1714661 entries, 0 to 11136777
Data columns (total 11 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         int64 
 5   startYear       object
 6   runtimeMinutes  object
 7   genres          object
 8   base_genre      object
 9   genre1          object
 10  genre2          object
dtypes: int64(1), object(10)
memory usage: 157.0+ MB


In [67]:
title_basics_movies[title_basics_movies['isAdult'] == '1']

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,base_genre,genre1,genre2
61185,tt0062417,movie,Un épais manteau de sang,Un épais manteau de sang,1,1968,88,Drama,Drama,,
61489,tt0062727,short,Of Special Merit,Besonders wertvoll,1,1968,11,"Adult,Short",Adult,Short,
62366,tt0063631,movie,Space Thing,Space Thing,1,1968,70,"Comedy,Sci-Fi",Comedy,Sci-Fi,
62773,tt0064057,movie,Bacchanales 69,Bacchanales 69,1,1969,95,\N,\N,,
63629,tt0064929,movie,The Amorous Headmaster,Sangen om den røde rubin,1,1970,107,"Comedy,Drama",Comedy,Drama,
...,...,...,...,...,...,...,...,...,...,...,...
11098756,tt9834162,short,Be a Hero,Be a Hero,1,2015,9,"Adult,Short",Adult,Short,
11098985,tt9834658,short,Entre dos sueños,Entre dos sueños,1,2010,21,"Adult,Short",Adult,Short,
11110851,tt9860530,short,Shower Time For The Girls,Shower Time For The Girls,1,2006,7,"Adult,Short",Adult,Short,
11125309,tt9892000,movie,The Secret Lives of Love Starved Housewives,The Secret Lives of Love Starved Housewives,1,\N,\N,"Adult,Fantasy",Adult,Fantasy,


title.crew.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
directors (array of nconsts) - director(s) of the given title
writers (array of nconsts) – writer(s) of the given title
title.episode.tsv.gz
tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
nconst (string) - alphanumeric unique identifier of the name/person
category (string) - the category of job that person was in
job (string) - the specific job title if applicable, else '\N'
characters (string) - the name of the character played if applicable, else '\N'
title.ratings.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received
name.basics.tsv.gz
nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string)– name by which the person is most often credited
birthYear – in YYYY format
deathYear – in YYYY format if applicable, else '\N'
primaryProfession (array of strings)– the top-3 professions of the person
knownForTitles (array of tconsts) – titles the person is known for