### Objective
I will cleaning and exploring this dataset using pandas. It will be visualised using matplotlib. Afterwards, I'll hypothesise the relationship and trends of this data. This dataset is obtained via [kaggle.com](https://www.kaggle.com/datasets/bharatnatrayn/movies-dataset-for-feature-extracion-prediction/data).


#### Importing Libraries

In [316]:
import pandas as pd
from IPython.display import display
import matplotlib as plt

#### Import Raw Dataset

In [317]:
df = pd.read_csv('Data/raw_movies.csv')
display(df.head())

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\r\nAction, Horror, Thriller",6.1,\r\nA woman with a mysterious illness is force...,\r\n Director:\r\nPeter Thorwarth\r\n| \r\n...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\r\nAnimation, Action, Adventure",5.0,\r\nThe war for Eternia begins again in what m...,"\r\n \r\n Stars:\r\nChris Wood, ...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\r\nDrama, Horror, Thriller",8.2,\r\nSheriff Deputy Rick Grimes wakes up from a...,\r\n \r\n Stars:\r\nAndrew Linco...,885805.0,44.0,
3,Rick and Morty,(2013– ),"\r\nAnimation, Adventure, Comedy",9.2,\r\nAn animated series that follows the exploi...,\r\n \r\n Stars:\r\nJustin Roila...,414849.0,23.0,
4,Army of Thieves,(2021),"\r\nAction, Crime, Horror",,"\r\nA prequel, set before the events of Army o...",\r\n Director:\r\nMatthias Schweighöfer\r\n...,,,


### Data Inspection

In [318]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MOVIES    9999 non-null   object 
 1   YEAR      9355 non-null   object 
 2   GENRE     9919 non-null   object 
 3   RATING    8179 non-null   float64
 4   ONE-LINE  9999 non-null   object 
 5   STARS     9999 non-null   object 
 6   VOTES     8179 non-null   object 
 7   RunTime   7041 non-null   float64
 8   Gross     460 non-null    object 
dtypes: float64(2), object(7)
memory usage: 703.2+ KB


(9999, 9)

In [319]:
df.describe()

Unnamed: 0,RATING,RunTime
count,8179.0,7041.0
mean,6.921176,68.688539
std,1.220232,47.258056
min,1.1,1.0
25%,6.2,36.0
50%,7.1,60.0
75%,7.8,95.0
max,9.9,853.0


In [320]:
df.isnull().sum()

MOVIES         0
YEAR         644
GENRE         80
RATING      1820
ONE-LINE       0
STARS          0
VOTES       1820
RunTime     2958
Gross       9539
dtype: int64

#### Observations
- There are 9 columns, with 9999 rows
- Most columns stores object
- RATING, RunTime: float64
- Columns w/ NULL: YEAR, GENRE, RATING, VOTES, RunTime, Gross

### Renaming Columns

In [321]:
df_col = df.rename(columns=
          {'MOVIES': 'Movies', 'YEAR': 'Year',
           'GENRE': 'Genre', 'RATING': 'Rating',
           'ONE-LINE': 'Short Desc', 'STARS': 'Stars',
           'VOTES': 'Votes', 'RunTime': 'Run Time',
           'Gross': 'Gross'})
print(df_col.columns)

Index(['Movies', 'Year', 'Genre', 'Rating', 'Short Desc', 'Stars', 'Votes',
       'Run Time', 'Gross'],
      dtype='object')


### Removing Duplicates

In [322]:
print(f'Duplicates: {df_col.duplicated().sum()}')
df_dupe = df_col.drop_duplicates()

Duplicates: 431


### Fixing Data Types
Votes: Remove ',' - convert to integer.

Gross: Remove '$' and 'M' - convert to float.

In [323]:
df_dupe.loc[:, 'Votes'] = df_dupe['Votes'].str.replace(',', '').astype(float)
df_dupe.loc[:, 'Gross'] = df_dupe['Gross'].str.replace('$', '').str.replace('M', '').astype(float)

In [324]:
df_fix = df_dupe
df_fix.info()

filtered_df_fix = df_fix[df_fix['Gross'].notnull()]
display(df_fix.head(3))

<class 'pandas.core.frame.DataFrame'>
Index: 9568 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Movies      9568 non-null   object 
 1   Year        9026 non-null   object 
 2   Genre       9490 non-null   object 
 3   Rating      8168 non-null   float64
 4   Short Desc  9568 non-null   object 
 5   Stars       9568 non-null   object 
 6   Votes       8168 non-null   object 
 7   Run Time    7008 non-null   float64
 8   Gross       460 non-null    object 
dtypes: float64(2), object(7)
memory usage: 747.5+ KB


Unnamed: 0,Movies,Year,Genre,Rating,Short Desc,Stars,Votes,Run Time,Gross
0,Blood Red Sky,(2021),"\r\nAction, Horror, Thriller",6.1,\r\nA woman with a mysterious illness is force...,\r\n Director:\r\nPeter Thorwarth\r\n| \r\n...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\r\nAnimation, Action, Adventure",5.0,\r\nThe war for Eternia begins again in what m...,"\r\n \r\n Stars:\r\nChris Wood, ...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\r\nDrama, Horror, Thriller",8.2,\r\nSheriff Deputy Rick Grimes wakes up from a...,\r\n \r\n Stars:\r\nAndrew Linco...,885805.0,44.0,


### Handling Missing Values
Gross & Run Time are the least important to the average viewers. Hence, the rows with missing values for any of these columns will not be removed.

Missing Values for Ratings & Votes will be replaced by the mean average for all recorded movies in the dataset.

All rows with missing Year will be removed.

In [325]:
# Drop rows with missing Year
df_mis = df_fix.dropna(subset=['Year'])

# Fill missing Ratings and Votes with mean
df_mis.loc[:, 'Rating'] = df_mis['Rating'].fillna(df_mis['Rating'].mean())
df_mis.loc[:, 'Votes'] = df_mis['Votes'].fillna(df_mis['Votes'].mean())

  df_mis.loc[:, 'Votes'] = df_mis['Votes'].fillna(df_mis['Votes'].mean())


In [326]:
df_mis.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9026 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Movies      9026 non-null   object 
 1   Year        9026 non-null   object 
 2   Genre       8989 non-null   object 
 3   Rating      9026 non-null   float64
 4   Short Desc  9026 non-null   object 
 5   Stars       9026 non-null   object 
 6   Votes       9026 non-null   object 
 7   Run Time    6987 non-null   float64
 8   Gross       460 non-null    object 
dtypes: float64(2), object(7)
memory usage: 705.2+ KB


### Formatting Text Data
Stars: Format from strings into dictionary:
- 'Director' & 'Stars' are keys.
- Names are stored in an array.

Example: {'Director': ['Person A', ..., 'Person N'], 'Stars': ['Person C', ... , 'Person M']}

- Genre: Remove \r\n - convert into an array.
- Short Desc: Remove \r\n.
- Rating: Round to 2 decimal place.
- Year: Convert into a tuple (StartYear, EndYear)
- Split Stars column into two: Director & Stars

In [327]:
def clean_genre(text):
    return text.strip().replace('\r\n', '').split(',')

In [328]:
def clean_short_desc(text):
    return text.strip().replace('\r\n', '')

In [329]:
def clean_rating(val):
    return round(val, 2)

In [330]:
def clean_year(val):
    val = val.replace('(', '').replace(')', '')
    arr = val.split('-')
    arr = [int(arr[0]), None] if arr[1] == '' else [int(arr[0]), int(arr[1])]
    return arr

In [331]:
def clean_stars(text):
    # Initialize dictionary
    dict = {'Director': [], 'Stars': []}
    
    # Remove whitespaces & labels (Director, Stars)
    removed_whitespace = text.strip().replace('\r\n', '').replace(' ', '')
    removed_labels = removed_whitespace.replace('Director:', '').replace('Stars:', '')
    
    # Split Director and Stars
    split_labels = removed_labels.split('|')
    
    # Split each element into an array
    if len(split_labels) == 2:
        dict.update({'Director': split_labels[0], 'Stars': split_labels[1]})
        for i in dict:
            dict[i] = dict[i].split(',')
    else:
        dict.update({'Director': split_labels[0], 'Stars': None})
    
        
    return dict

In [335]:
test_text = """    Director:\r\nPeter Thorwarth\r\n| \r\n    Stars:\r\nPeri Baumeister, \r\nCarl Anton Koch, \r\nAlexander Scheer, \r\nKais Setti\r\n"""
print(clean_stars(test_text))

{'Director': ['PeterThorwarth'], 'Stars': ['PeriBaumeister', 'CarlAntonKoch', 'AlexanderScheer', 'KaisSetti']}


In [332]:
# {'Director': ['Person A', ..., 'Person N'], 'Stars': ['Person C', ... , 'Person M']}
df_mis['Director'] = df_mis['Stars'].apply(clean_stars)
display(df_mis.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mis['Director'] = df_mis['Stars'].apply(clean_stars)


Unnamed: 0,Movies,Year,Genre,Rating,Short Desc,Stars,Votes,Run Time,Gross,Director
0,Blood Red Sky,(2021),"\r\nAction, Horror, Thriller",6.1,\r\nA woman with a mysterious illness is force...,\r\n Director:\r\nPeter Thorwarth\r\n| \r\n...,21062.0,121.0,,"{'Director': ['PeterThorwarth'], 'Stars': ['Pe..."
1,Masters of the Universe: Revelation,(2021– ),"\r\nAnimation, Action, Adventure",5.0,\r\nThe war for Eternia begins again in what m...,"\r\n \r\n Stars:\r\nChris Wood, ...",17870.0,25.0,,"{'Director': 'ChrisWood,SarahMichelleGellar,Le..."
2,The Walking Dead,(2010–2022),"\r\nDrama, Horror, Thriller",8.2,\r\nSheriff Deputy Rick Grimes wakes up from a...,\r\n \r\n Stars:\r\nAndrew Linco...,885805.0,44.0,,"{'Director': 'AndrewLincoln,NormanReedus,Melis..."
3,Rick and Morty,(2013– ),"\r\nAnimation, Adventure, Comedy",9.2,\r\nAn animated series that follows the exploi...,\r\n \r\n Stars:\r\nJustin Roila...,414849.0,23.0,,"{'Director': 'JustinRoiland,ChrisParnell,Spenc..."
4,Army of Thieves,(2021),"\r\nAction, Crime, Horror",6.919699,"\r\nA prequel, set before the events of Army o...",\r\n Director:\r\nMatthias Schweighöfer\r\n...,15144.414422,,,"{'Director': ['MatthiasSchweighöfer'], 'Stars'..."


In [333]:
df_mis.loc[:, 'Stars'] = df_mis['Stars'].apply(clean_stars)

In [334]:
display(df_mis.head())

Unnamed: 0,Movies,Year,Genre,Rating,Short Desc,Stars,Votes,Run Time,Gross,Director
0,Blood Red Sky,(2021),"\r\nAction, Horror, Thriller",6.1,\r\nA woman with a mysterious illness is force...,"{'Director': ['PeterThorwarth'], 'Stars': ['Pe...",21062.0,121.0,,"{'Director': ['PeterThorwarth'], 'Stars': ['Pe..."
1,Masters of the Universe: Revelation,(2021– ),"\r\nAnimation, Action, Adventure",5.0,\r\nThe war for Eternia begins again in what m...,"{'Director': 'ChrisWood,SarahMichelleGellar,Le...",17870.0,25.0,,"{'Director': 'ChrisWood,SarahMichelleGellar,Le..."
2,The Walking Dead,(2010–2022),"\r\nDrama, Horror, Thriller",8.2,\r\nSheriff Deputy Rick Grimes wakes up from a...,"{'Director': 'AndrewLincoln,NormanReedus,Melis...",885805.0,44.0,,"{'Director': 'AndrewLincoln,NormanReedus,Melis..."
3,Rick and Morty,(2013– ),"\r\nAnimation, Adventure, Comedy",9.2,\r\nAn animated series that follows the exploi...,"{'Director': 'JustinRoiland,ChrisParnell,Spenc...",414849.0,23.0,,"{'Director': 'JustinRoiland,ChrisParnell,Spenc..."
4,Army of Thieves,(2021),"\r\nAction, Crime, Horror",6.919699,"\r\nA prequel, set before the events of Army o...","{'Director': ['MatthiasSchweighöfer'], 'Stars'...",15144.414422,,,"{'Director': ['MatthiasSchweighöfer'], 'Stars'..."


### Handling Outliers

### Filtering Irrelevant Data

### Validating Data Consistency