# Data Cleaning Notebook

## Introduction
This Jupyter Notebook is dedicated to the crucial task of data cleaning. Data cleaning is an essential step in the data analysis process, ensuring that the dataset is accurate, consistent, and ready for further analysis. In this notebook, we will explore various data cleaning techniques to handle missing values, outliers, and inconsistencies in the dataset.

## Objective
The main objectives of this notebook are as follows:
- Identify and handle missing data.
- Detect and address outliers.
- Standardize and clean data formats.
- Address any inconsistencies or errors in the dataset.

## Dataset
For the purpose of this notebook, we are working with the IMDB Dataset especially Name of the persons. The dataset contains [nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles] and is sourced from Kaggle. Cleaning this dataset is crucial for accurate and meaningful analysis.

## Methods
The data cleaning process will involve the following steps:
1. **Handling Missing Data:** Identifying and addressing missing values in the dataset.
2. **Outlier Detection and Treatment:** Identifying outliers and deciding on the appropriate treatment.
3. **Data Standardization:** Ensuring consistent formats for data types, such as dates and categorical variables.
4. **Error Correction:** Addressing any inconsistencies or errors in the dataset.

## Code Explanation
The code blocks in this notebook will demonstrate the application of various data cleaning techniques. Each step will be accompanied by comments and explanations to enhance readability.

## Results
The results of the data cleaning process will be reflected in an improved, cleaned dataset. Visualizations and summary statistics may be used to showcase the impact of the cleaning process on the data quality.

## Conclusion
Data cleaning lays the foundation for robust and reliable data analysis. A thoroughly cleaned dataset ensures that subsequent analyses are based on accurate and consistent information. This notebook concludes with a clean and refined dataset ready for further exploration and modeling.

## References
No external references are used in this notebook.

## Acknowledgments
No external contributions are acknowledged in this notebook.

In [16]:
#Importing Necessary library for data cleaning
import pandas as pd 
import numpy as np

In [17]:
#Loading data set in variable name
name_Dataset=pd.read_csv("C:/Users/Mohammed/Downloads/IMDB/Movie_Data/name.basics.tsv/data.tsv",sep="\t",on_bad_lines='skip')

In [65]:
#Checking How the dataset look's like
name_Dataset.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0053137,tt0050419,tt0031983"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0075213,tt0037382,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0056404,tt0054452,tt0057345,tt0049189"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0080455,tt0078723,tt0072562,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0050976,tt0069467,tt0083922"


#### Checking information about the dataset


In [66]:
name_data_2=name_Dataset.copy()

In [67]:
#Length check
name_data_2.shape

(13123690, 6)

In [68]:
name_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13123690 entries, 0 to 13123689
Data columns (total 6 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   nconst             object
 1   primaryName        object
 2   birthYear          object
 3   deathYear          object
 4   primaryProfession  object
 5   knownForTitles     object
dtypes: object(6)
memory usage: 600.8+ MB


In above Info functions shows the information only about size and number of records and most importantly what the type of data.

In [76]:
col_name=name_data_2.columns
col_name

Index(['nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession',
       'knownForTitles'],
      dtype='object')

In [77]:
for col in col_name:
    print(col)

nconst
primaryName
birthYear
deathYear
primaryProfession
knownForTitles


In [87]:
name_data_2.shape[0]

13123690

In [91]:
col_unique_val_dict={}


list_col_name=[]
value_col=[]
no_duplicates=[]
for col in (col_name):
    list_col_name.append(col)
    value_col.append(len(name_data_2[col].unique()))
    #no_duplicates.append(int(name_data_2.shape[0])-len(name_data_2[col].unique()))
    
col_unique_val_dict["Name"]=(list_col_name)
col_unique_val_dict["Unique_Val"]=(value_col)
#no_duplicates["Duplicates"]=no_duplicates
print(col_unique_val_dict)

{'Name': ['nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles'], 'Unique_Val': [13123690, 10134299, 528, 470, 21645, 5577243]}


In [98]:
no_duplicates=[]
for col in (col_name):
    no_duplicates.append(int(name_data_2.shape[0])-len(name_data_2[col].unique()))
col_unique_val_dict["Duplicates"]=no_duplicates

In [99]:
col_unique_val_dict

{'Name': ['nconst',
  'primaryName',
  'birthYear',
  'deathYear',
  'primaryProfession',
  'knownForTitles'],
 'Unique_Val': [13123690, 10134299, 528, 470, 21645, 5577243],
 'Duplicates': [0, 2989391, 13123162, 13123220, 13102045, 7546447]}

In [100]:
unique_value_dataFrame=pd.DataFrame(col_unique_val_dict)
unique_value_dataFrame

Unnamed: 0,Name,Unique_Val,Duplicates
0,nconst,13123690,0
1,primaryName,10134299,2989391
2,birthYear,528,13123162
3,deathYear,470,13123220
4,primaryProfession,21645,13102045
5,knownForTitles,5577243,7546447


In [107]:
name_data_2.describe()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
count,13123690,13123683,13123690,13123690,10500647,13123690
unique,13123690,10134298,528,470,21644,5577243
top,nm0000001,Alex,\N,\N,actor,\N
freq,1,458,12522797,12900331,2369142,1463153


In [108]:
#Checking for total NaN values
name_data_2.isna().sum()

nconst                     0
primaryName                7
birthYear                  0
deathYear                  0
primaryProfession    2623043
knownForTitles             0
dtype: int64

As of now It seems three of the columns in data is not relevent for out recomentation system, so I'm gonna drop this
        
        - birthYear
        - deathYear
        - primaryProfession


In [113]:
#Dataset After dropped

Relevent_Data=name_data_2.drop(columns=["birthYear","deathYear","primaryProfession"])
Relevent_Data.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm0000001,Fred Astaire,"tt0072308,tt0053137,tt0050419,tt0031983"
1,nm0000002,Lauren Bacall,"tt0038355,tt0075213,tt0037382,tt0117057"
2,nm0000003,Brigitte Bardot,"tt0056404,tt0054452,tt0057345,tt0049189"
3,nm0000004,John Belushi,"tt0080455,tt0078723,tt0072562,tt0077975"
4,nm0000005,Ingmar Bergman,"tt0050986,tt0050976,tt0069467,tt0083922"


In [121]:
Relevent_Data=Relevent_Data[Relevent_Data.primaryName.isna()==False]

In [129]:
Relevent_Data.to_csv("C:/Users/Mohammed/OneDrive/Desktop/Cleaned_Data_IMDB/Name/Name_Data.csv")