## Dataset Description
This data set contains information about 9000+ movies extracted from TMDB API on the Web.

## Columns Descriptions
1. `Release_Date`: Date when the movie was released.
2. `Title`: Name of the movie.
3. `Overview`: Brief summary of the movie.
4. `Popularity`: It is a very important metric computed by TMDB developers based on the number of views per day, votes per day, number of users marked it as "favorite" and "watchlist" for the data, release date and more other metrics.
5. `Vote_Count`: Total votes received from the viewers.
6. `Vote_Average`: Average rating based on vote count and the number of viewers out of 10.
7. `Original_Language`: Original language of the movies. Dubbed version is not considered to be original language.
8. `Genre`: Categories the movie it can be classified as.
9. `Poster_Url`: Url of the movie poster.

## EDA Questions
- Q1: What is the most frequent `genre` in the dataset?
- Q2: What `genres` has highest `votes`?
- Q3: What movie got the highest `popularity`? what's its `genre`?
- Q4: Which year has the most filmmed movies?
___

In [2]:
!pip install seaborn





[notice] A new release of pip is available: 24.2 -> 24.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# importing lib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# getting dataset file dir.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [4]:
def catigorize_col (df, col, labels):
    """
    catigorizes a certain column based on its quartiles

    Args:
        (df)     df   - dataframe we are proccesing
        (col)    str  - to be catigorized column's name
        (labels) list - list of labels from min to max

    Returns:
        (df)     df   - dataframe with the categorized col
    """

    # setting the edges to cut the column accordingly
    edges = [df[col].describe()['min'],
             df[col].describe()['25%'],
             df[col].describe()['50%'],
             df[col].describe()['75%'],
             df[col].describe()['max']]

    df[col] = pd.cut(df[col], edges, labels = labels, duplicates='drop')
    return df

In [4]:
# Importing the pandas library
import pandas as pd

# Loading data and viewing its first 5 rows
df = pd.read_csv('mymoviedb.csv', lineterminator='\n')
df.head()


Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url\r
0,15-12-2021,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,01-03-2022,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,25-02-2022,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,24-11-2021,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,22-12-2021,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [5]:
# viewing dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9837 entries, 0 to 9836
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Release_Date       9837 non-null   object 
 1   Title              9828 non-null   object 
 2   Overview           9828 non-null   object 
 3   Popularity         9827 non-null   float64
 4   Vote_Count         9827 non-null   object 
 5   Vote_Average       9827 non-null   object 
 6   Original_Language  9827 non-null   object 
 7   Genre              9826 non-null   object 
        9837 non-null   object 
dtypes: float64(1), object(8)
memory usage: 691.8+ KB


- Looks like our dataset has no NaNs(There is no missing or undefined values in the dataset)!
- `Overview`, `Original_Languege` and `Poster-Url` wouldn't be so useful during analysis.
- `Release_Date` column needs to be casted into date time and to extract only the year value.

In [6]:
df['Genre'].head()

0    Action, Adventure, Science Fiction
1              Crime, Mystery, Thriller
2                              Thriller
3    Animation, Comedy, Family, Fantasy
4      Action, Adventure, Thriller, War
Name: Genre, dtype: object

- Genres are seperated by commas followed by whitespaces.

In [7]:
#Checking for any duplicated rows in the database
df.duplicated().sum()

0

- our dataset has no duplicated rows either.

In [8]:
# exploring summary statistics
df.describe()

Unnamed: 0,Popularity
count,9827.0
mean,40.32057
std,108.874308
min,7.1
25%,16.1275
50%,21.191
75%,35.1745
max,5083.954


### Exploration Summary
- We have a dataframe consisting of 9827 rows and 9 columns.
- Our dataset looks a bit tidy with no NaNs nor duplicated values.
- `Release_Date` column needs to be casted into date time and to extract only the year value.
- `Overview`, `Original_Languege` and `Poster-Url` wouldn't be so useful during analysis, so we'll drop them.
- There is noticable outliers in `Popularity` column
- `Vote_Average` bettter be categorised for proper analysis.
- `Genre` column has comma saperated values and white spaces that needs to be handled and casted into category.
___

## Data Cleaning

**Casting `Release_Date` column and extracing year values**

In [9]:
# Convert column to datetime, invalid parsing will be set as NaT
df['Release_Date'] = pd.to_datetime(df['Release_Date'], errors='coerce', format='%d-%m-%Y')

# Print the result and check for NaT values
print(df['Release_Date'].dtypes)
print(df[df['Release_Date'].isna()])


datetime64[ns]
     Release_Date   Title Overview  Popularity Vote_Count Vote_Average  \
1106          NaT     NaN      NaN         NaN        NaN          NaN   
1107          NaT     NaN      NaN         NaN        NaN          NaN   
1108          NaT     NaN      NaN         NaN        NaN          NaN   
1109          NaT     NaN      NaN         NaN        NaN          NaN   
1110          NaT     NaN      NaN         NaN        NaN          NaN   
1111          NaT     NaN      NaN         NaN        NaN          NaN   
1112          NaT     NaN      NaN         NaN        NaN          NaN   
1113          NaT     NaN      NaN         NaN        NaN          NaN   
1114          NaT     NaN      NaN         NaN        NaN          NaN   
1115          NaT  61.328       35         7.1         en    Animation   

                                      Original_Language Genre Poster_Url\r  
1106                                                NaN   NaN           \r  
1107            

In [10]:
# casting column a
df['Release_Date'] = pd.to_datetime(df['Release_Date'])

# confirming changes
print(df['Release_Date'].dtypes)

datetime64[ns]


In [11]:
df['Release_Date'] = df['Release_Date'].dt.year
df['Release_Date'].dtypes

dtype('float64')

In [12]:
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url\r
0,2021.0,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022.0,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022.0,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021.0,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021.0,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


___
**Dropping `Overview`, `Original_Languege` and `Poster-Url`**

In [13]:
# Define the list of columns to be dropped
cols = ['Overview', 'Original_Language', 'Poster_Url']

# Drop columns, ignoring errors if any column is not found
df.drop(cols, axis=1, inplace=True, errors='ignore')

# Confirm changes
df.columns


Index(['Release_Date', 'Title', 'Popularity', 'Vote_Count', 'Vote_Average',
       'Genre', 'Poster_Url\r'],
      dtype='object')

___
**categorizing `Vote_Average` column**

We would cut the `Vote_Average` values and make 4 categories: `popular` `average` `below_avg` `not_popular` to describe it more using `catigorize_col()` function provided above.

In [15]:
import pandas as pd

# Define edges (bins) and labels
bins = [0, 2.5, 5, 7.5, 10]  # Adjust edges according to your data
labels = ['not_popular', 'below_avg', 'average', 'popular']

# Function to categorize column based on bins and labels
def categorize_col(df, col, labels, bins):
    df[col] = pd.cut(df[col], bins=bins, labels=labels, include_lowest=True)

# Example DataFrame
data = {'Vote_Average': [1.0, 3.0, 5.5, 7.0, 9.5]}
df = pd.DataFrame(data)

# Categorize 'Vote_Average' column
categorize_col(df, 'Vote_Average', labels, bins)

# Confirm changes
print(df)
print(df['Vote_Average'].unique())


  Vote_Average
0  not_popular
1    below_avg
2      average
3      average
4      popular
['not_popular', 'below_avg', 'average', 'popular']
Categories (4, object): ['not_popular' < 'below_avg' < 'average' < 'popular']


In [17]:
# exploring column
df['Vote_Average'].value_counts()

Vote_Average
average        2
not_popular    1
below_avg      1
popular        1
Name: count, dtype: int64

In [18]:
# dropping NaNs
df.dropna(inplace = True)

# confirming
df.isna().sum()

Vote_Average    0
dtype: int64

___
**Handling `Genre` column's comma saperated values**

### TODO
for this challenging column, we choose an approach that consists of stacking genres into a dataframe, and then merging it to our original dataframe. we'd explain further in the next cells.

In [19]:
# creating a new dataframe that holds all genres for each movie
#genres_df = df['Genre'].str.split(", ", expand=True)

# viewing its head
#genres_df.head()

Now that we have our dataframe of genres done, we'd move next into making a stack out of it, so that every movie would be represented by a stack of genres.

In [20]:
# stacking genres dataframe
#genres_df = genres_df.stack()

# configuring it as pandas dataframe
#genres_df = pd.DataFrame(genres_df)

# viewing its first 10 rows
#genres_df.head(10)

In [21]:
#Renaming the genres column and confirming value count
#genres_df.rename(columns={0:'genres_stack'}, inplace=True)
#genres_df.genres_stack.value_counts()

Now we have successfully created a new dataframe containing a stack of all movies' genres, we'd move into merging it with the original datarame
___

### we'd split genres into a list and then explode our dataframe to have only one genre per row for ezch movie

In [23]:
df['Genre'] = df['Genre'].str.split(', ')
df = df.explode('Genre').reset_index(drop=True)
df.head()


KeyError: 'Genre'