---

### Notes:

1. **Importing Pandas Library**  
   - `import pandas as pd`: Loads the Pandas library, which is used for data manipulation and analysis.

2. **Reading the Dataset**  
   - `movies = pd.read_csv('/content/top10K-TMDB-movies.csv')`: Reads the dataset from the specified CSV file and stores it in the `movies` DataFrame.

3. **Previewing Data**  
   - `movies.head(10)`: Displays the first 10 rows of the dataset, giving a quick glance at the structure and content of the data.

4. **Dataset Information**  
   - `movies.info()`: Prints a concise summary of the DataFrame, including:
     - Column names
     - Data types
     - Non-null counts
     - Memory usage
   - Useful for understanding the dataset's structure.

5. **Checking for Missing Values**  
   - `movies.isnull().sum()`: Identifies missing values in each column by summing up the count of `NaN` (null) entries.
   - Helps determine which columns need cleaning or imputation.

6. **Selecting Specific Columns**  
   - `movies = movies[['id', 'title', 'overview', 'genre']]`: Filters the DataFrame to retain only the specified columns: `id`, `title`, `overview`, and `genre`.
   - This simplifies the dataset, keeping only relevant information for analysis.

---

In [2]:
import pandas as pd


In [3]:
movies=pd.read_csv('/content/top10K-TMDB-movies.csv')
movies.head(10)

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811
5,667257,Impossible Things,"Family,Drama",es,"Matilde is a woman who, after the death of her...",14.358,2021-06-17,8.6,255
6,129,Spirited Away,"Animation,Family,Fantasy",ja,"A young girl, Chihiro, becomes trapped in a st...",92.056,2001-07-20,8.5,13093
7,730154,Your Eyes Tell,"Romance,Drama",ja,"A tragic accident lead to Kaori's blindness, b...",51.345,2020-10-23,8.5,339
8,372754,Dou kyu sei – Classmates,"Romance,Animation",ja,"Rihito Sajo, an honor student with a perfect s...",14.285,2016-02-20,8.5,239
9,372058,Your Name.,"Romance,Animation,Drama",ja,High schoolers Mitsuha and Taki are complete s...,158.27,2016-08-26,8.5,8895


In [4]:
print('Data Count: ')
movies.info()
print('\n')
print("Checking Missing values: ")
movies.isnull().sum()

Data Count: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.2+ KB


Checking Missing values: 


Unnamed: 0,0
id,0
title,0
genre,3
original_language,0
overview,13
popularity,0
release_date,0
vote_average,0
vote_count,0


In [5]:
movies = movies[['id', 'title','overview','genre']]

# Feature Selection Part

---

### Notes:

1. **Viewing Column Names**  
   - `movies.columns`: Returns a list of all column names in the `movies` DataFrame.
   - Useful to check the structure of the dataset and verify column names.

2. **Creating a New Column (`tags`)**  
   - `movies['tags'] = movies['overview'] + movies['genre']`:
     - Combines the values of the `overview` and `genre` columns into a new column called `tags`.
     - Useful for creating a single column with consolidated information, such as for text analysis or recommendation systems.

3. **Previewing the Modified DataFrame**  
   - `movies`: Displays the updated DataFrame, now including the new `tags` column.

4. **Dropping Unnecessary Columns**  
   - `movies_new = movies.drop(columns=['overview', 'genre'])`: Creates a new DataFrame (`movies_new`) by removing the `overview` and `genre` columns from `movies`.
   - Helps declutter the dataset by keeping only essential columns.

5. **Printing the Updated Dataset**  
   - `print('printing improved movies dataset')`: Adds context to the output to make it clear what is being displayed.
   - `movies_new`: Shows the final improved DataFrame.

---
This snippet demonstrates column manipulation and dataset refinement for further analysis.😊

In [6]:
movies.columns

Index(['id', 'title', 'overview', 'genre'], dtype='object')

In [7]:
movies['tags'] = movies['overview'] + movies['genre']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['overview'] + movies['genre']


In [8]:
movies

Unnamed: 0,id,title,overview,genre,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,"Drama,Crime",Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","Drama,Crime","Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...,"Drama,History,War",The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,"Drama,Crime",In the continuing saga of the Corleone crime f...
...,...,...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo...","Action,Adventure,Fantasy","The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...,"Action,TV Movie,Science Fiction,Comedy,Adventure",The sharks take bite out of the East Coast whe...
9997,13995,Captain America,"During World War II, a brave, patriotic Americ...","Action,Science Fiction,War","During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...,"Adventure,Fantasy,Action,Drama",A man named Farmer sets out to rescue his kidn...


In [9]:
movies_new = movies.drop(columns=['overview', 'genre'])
print('printing improved movies dataset')
movies_new

printing improved movies dataset


Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...
...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...
9997,13995,Captain America,"During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...


#Text Cleaning(NLP Text Cleaning with NLTK)

---

# Notes:

#### 1. **Importing Libraries**  
   - **`nltk`**: A powerful library for natural language processing (NLP).  
   - **`re`**: Python's regular expressions library for text pattern matching.  
   - **Submodules:**
     - `nltk.corpus.stopwords`: Provides a list of common stop words in various languages.
     - `nltk.stem.WordNetLemmatizer`: Used for lemmatization (reducing words to their base or root form).
     - `nltk.tokenize.word_tokenize`: Tokenizes (splits) text into individual words.

#### 2. **Downloading Required NLTK Resources**  
   - `nltk.download('punkt')`: Downloads the tokenizer models.
   - `nltk.download('stopwords')`: Downloads the stopwords list.
   - `nltk.download('wordnet')`: Downloads the WordNet corpus required for lemmatization.
   - `nltk.download('punkt_tab')`: Additional data for tokenization.  

#### 3. **Defining the `clean_text` Function**  
   - **Purpose**: Prepares and cleans text data for NLP tasks by performing operations like lowercase conversion, punctuation removal, stopword removal, tokenization, and lemmatization.
   - **Steps:**
     1. **Handle Non-String or NaN Input**:  
        - Returns an empty string if the input is not valid.
     2. **Convert to Lowercase**: Ensures uniformity for processing.
     3. **Remove Punctuation but Keep Digits**:  
        - `re.sub(r'[^\w\s\d]', '', text)`: Removes punctuation while retaining alphanumeric characters and whitespace.
     4. **Tokenize Text**: Splits text into individual words for further processing.
     5. **Remove Stop Words**:  
        - Common words like "is," "the," "and" are removed using the predefined English stopwords list.
     6. **Lemmatization**: Converts words to their root forms (e.g., "running" → "run").
     7. **Join Words Back Together**: Combines cleaned tokens into a single string.

#### 4. **Applying the Function to Clean Data**  
   - `movies_new['tags_cleaned'] = movies_new['tags'].apply(clean_text)`:  
     - Applies the `clean_text` function to the `tags` column in the `movies_new` DataFrame.  
     - Creates a new column, `tags_cleaned`, containing the preprocessed text.

---

### Applications:
- **Text Preprocessing**: Essential for NLP tasks like sentiment analysis, topic modeling, or text classification.
- **Improved Data Quality**: The cleaned `tags_cleaned` column ensures consistency and removes noise from the dataset.

---

In [10]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#Download Necessary NLTK Resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

#Cleaning the test
def clean_text(text):
  #Checking if the text is not a string, or if it's NaN
  if not isinstance(text, str) or pd.isna(text): # Check for NaN values as well
    return ""
  #Convert to lowercase
  text = text.lower()
  #Remove punctuation but keep digits
  text = re.sub(r'[^\w\s\d]', '', text)
  #Tokenize
  words = word_tokenize(text)
  #Removing the stop words
  stop_words = set(stopwords.words('english'))
  words = [word for word in words if word not in stop_words ]

  #lematize
  lemmatizer = WordNetLemmatizer()
  words = [lemmatizer.lemmatize(word) for word in words]

  #Join the words back together
  cleaned_text = ' '.join(words)
  return cleaned_text


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [11]:
movies_new['tags_cleaned'] = movies_new['tags'].apply(clean_text)

# Count Vectorization

---

### Notes:

#### 1. **Importing Required Libraries**  
   - `from sklearn.feature_extraction.text import CountVectorizer`:  
     - Used to convert text data into numerical feature vectors based on word frequency.
   - `from sklearn.model_selection import train_test_split`:  
     - Splits the dataset into training and testing subsets for validation purposes.

#### 2. **Splitting Data**  
   - `train_data, test_data = train_test_split(movies_new, test_size=0.2, random_state=42)`:  
     - Splits the `movies_new` DataFrame into 80% training data and 20% test data.  
     - `random_state=42` ensures reproducibility.

#### 3. **Text Vectorization**  
   - `cv = CountVectorizer(max_features=10000, stop_words='english')`:  
     - Initializes a `CountVectorizer` to create a bag-of-words representation of the cleaned tags.  
     - Limits vocabulary to the top 10,000 words and excludes common English stopwords.
   - `vector = cv.fit_transform(movies_new['tags_cleaned'].values.astype('U')).toarray()`:  
     - Converts the `tags_cleaned` column into a sparse matrix of token counts, then transforms it into a dense array.  
   - `vector.shape`: Returns the dimensions of the feature matrix.

#### 4. **Cosine Similarity Calculation**  
   - `from sklearn.metrics.pairwise import cosine_similarity`:  
     - Calculates the cosine similarity between rows in the feature matrix to measure the similarity between movie tags.  
   - `similarity = cosine_similarity(vector)`:  
     - Creates a similarity matrix where each entry represents the cosine similarity between two movies.

#### 5. **Inspecting Dataset**  
   - `movies_new.info()`: Displays the structure of the `movies_new` DataFrame.

#### 6. **Finding Similar Movies**  
   - `distance = sorted(list(enumerate(similarity[4])), reverse=True, key=lambda vector: vector[1])`:  
     - Finds the movies most similar to the 5th movie in the dataset (`index=4`) based on cosine similarity.  
     - Sorts the similarity scores in descending order.
   - Loop:  
     - Iterates over the top 5 similar movies and prints their titles using `movies_new.iloc[i[0]].title`.

#### 7. **Recommender Function**  
   - `def recommend_movies(movies)`:  
     - A function that takes a movie title as input and recommends the top 5 most similar movies.  
   - Process:  
     1. Finds the index of the input movie title.
     2. Calculates similarity scores for the movie.
     3. Sorts the scores in descending order.
     4. Prints the titles of the top 5 recommendations.

#### 8. **Usage Example**  
   - `recommend_movies("The Avengers")`:  
     - Recommends movies similar to *The Avengers* based on cosine similarity of tags.

---

### Applications:
- **Content-Based Recommendation System**:  
   - This approach recommends movies based on their text features (e.g., tags, genres, or overviews).  
- **Cosine Similarity**:  
   - Measures the angle between two feature vectors, making it ideal for comparing text data regardless of magnitude.

---

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection  import train_test_split


In [13]:
train_data, test_data = train_test_split(movies_new, test_size=0.2, random_state=42)

In [14]:
cv = CountVectorizer(max_features=10000, stop_words='english')

In [15]:
vector = cv.fit_transform(movies_new['tags_cleaned'].values.astype('U')).toarray()

In [16]:
movies_new['tags']

Unnamed: 0,tags
0,Framed in the 1940s for the double murder of h...
1,"Raj is a rich, carefree, happy-go-lucky second..."
2,"Spanning the years 1945 to 1955, a chronicle o..."
3,The true story of how businessman Oskar Schind...
4,In the continuing saga of the Corleone crime f...
...,...
9995,"The story follows the adventures of Aang, a yo..."
9996,The sharks take bite out of the East Coast whe...
9997,"During World War II, a brave, patriotic Americ..."
9998,A man named Farmer sets out to rescue his kidn...


In [17]:
vector.shape

(10000, 10000)

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
similarity = cosine_similarity(vector)

In [20]:
similarity

array([[1.        , 0.03175003, 0.03035884, ..., 0.0862796 , 0.09274778,
        0.03745029],
       [0.03175003, 1.        , 0.05976143, ..., 0.        , 0.        ,
        0.        ],
       [0.03035884, 0.05976143, 1.        , ..., 0.        , 0.04364358,
        0.03524537],
       ...,
       [0.0862796 , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.09274778, 0.        , 0.04364358, ..., 0.        , 1.        ,
        0.        ],
       [0.03745029, 0.        , 0.03524537, ..., 0.        , 0.        ,
        1.        ]])

In [21]:
movies_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            10000 non-null  int64 
 1   title         10000 non-null  object
 2   tags          9985 non-null   object
 3   tags_cleaned  10000 non-null  object
dtypes: int64(1), object(3)
memory usage: 312.6+ KB


In [22]:
distance = sorted(list(enumerate(similarity[4])), reverse = True, key = lambda vector:vector[1])
for i in distance[0:5]:
  print(movies_new.iloc[i[0]].title)

The Godfather: Part II
The Godfather
The Godfather: Part III
Brooklyn
Kind Hearts and Coronets


In [23]:
def recommend_movies(movies):
  index = movies_new[movies_new['title'] == movies].index[0]
  distance = sorted(list(enumerate(similarity[index])), reverse = True, key = lambda vector:vector[1])
  for i in distance[0:5]:
    print(movies_new.iloc[i[0]].title)

In [24]:
recommend_movies("The Avengers")

The Avengers
Team America: World Police
Enemy Mine
Eternals
Living in Oblivion


---

### Notes:

#### 1. **Importing the `pickle` Library**  
   - `pickle`: A Python module used for serializing and deserializing Python objects.  
     - **Serialization**: Converts Python objects into a binary format that can be saved to a file.  
     - **Deserialization**: Converts binary data back into Python objects.

#### 2. **Saving Data with `pickle.dump`**  
   - **Syntax**: `pickle.dump(object, open(filename, mode))`  
     - Saves an object to a file in binary format.
   - `pickle.dump(movies_new, open('movies_list.pkl', 'wb'))`:  
     - Serializes the `movies_new` DataFrame and saves it to a file named `movies_list.pkl`.  
     - `wb`: Write-binary mode.
   - `pickle.dump(similarity, open('similarity.pkl', 'wb'))`:  
     - Serializes the `similarity` matrix and saves it to a file named `similarity.pkl`.

#### 3. **Loading Data with `pickle.load`**  
   - **Syntax**: `pickle.load(open(filename, mode))`  
     - Reads a binary file and deserializes it back into a Python object.
   - `pickle.load(open('movies_list.pkl', 'rb'))`:  
     - Loads the serialized `movies_new` DataFrame from the `movies_list.pkl` file.  
     - `rb`: Read-binary mode.

---

### Applications:
1. **Saving Models and Data**:  
   - `pickle` is commonly used to save trained machine learning models, data, and precomputed values (like similarity matrices) for later use.
   
2. **Portability**:  
   - The `.pkl` files can be shared and reloaded on different systems or sessions.

3. **Efficiency**:  
   - Saves time by avoiding recomputation of large datasets or complex calculations.

---


In [25]:
import pickle

In [26]:
pickle.dump(movies_new, open('movies_list.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))


In [27]:
pickle.load(open('movies_list.pkl', 'rb'))

Unnamed: 0,id,title,tags,tags_cleaned
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,framed 1940s double murder wife lover upstandi...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...",raj rich carefree happygolucky second generati...
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",spanning year 1945 1955 chronicle fictional it...
3,424,Schindler's List,The true story of how businessman Oskar Schind...,true story businessman oskar schindler saved t...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,continuing saga corleone crime family young vi...
...,...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo...",story follows adventure aang young successor l...
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...,shark take bite east coast sharknado hit washi...
9997,13995,Captain America,"During World War II, a brave, patriotic Americ...",world war ii brave patriotic american soldier ...
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...,man named farmer set rescue kidnapped wife ave...
