# Data Preprocessing and cleaning 

## Import Libraries

We start by importing the necessary libraries for our analysis. Here's what each library is used for:

- `pandas`: Used for data manipulation and analysis.
- `matplotlib.pyplot`: Provides a MATLAB-like plotting framework.
- `seaborn`: A statistical data visualization library based on matplotlib, used for creating attractive and informative statistical graphics.
- `MinMaxScaler`, `Normalizer`, `StandardScaler` from `sklearn.preprocessing`: These are tools for data preprocessing, specifically for scaling numerical features.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler

## Loading the Dataset

In this section, we load the dataset from a CSV file named `tv_shows.csv`.
We use the `pd.read_csv()` function from the pandas library to read the CSV file and store the data in a DataFrame named `tv_data`.

In [None]:
tv_data = pd.read_csv('tv_shows.csv')

## Data Exploration

### Viewing the First Few Rows

We start by displaying the first few rows of the dataset using the `head()` method. This gives us a quick glimpse of what the data looks like and helps us understand its structure.



In [None]:
tv_data.head()

### Getting Data Columns

Here, we retrieve the column names of the dataset using the columns attribute. This provides a list of all the columns present in the dataset.

In [None]:
data_columns = tv_data.columns
print("Data columns",data_columns)

### Getting Data Information

We use the info method to get a concise summary of the dataset, including the number of non-null values and data types of each column. This helps us understand the completeness and structure of the dataset.

In [None]:
data_info = tv_data.info
print("The data info",data_info)

### Getting Data Shape

The shape attribute gives us the dimensions of the dataset, i.e., the number of rows and columns. This helps us understand the size of the dataset.

In [None]:
data_shape = tv_data.shape
print("The data shape",data_shape)

### Descriptive Statistics

We use the describe() method to generate descriptive statistics of the numerical columns in the dataset, such as count, mean, standard deviation, minimum, and maximum values. This provides insights into the central tendency, dispersion, and shape of the distribution of numerical data.

In [None]:
tv_data.describe()

### Data Types

The dtypes attribute gives us the data types of each column in the dataset. Understanding the data types is crucial for data preprocessing and analysis, as it determines the operations we can perform on the data.

In [None]:
tv_data.dtypes

## Data Cleaning

In this section, we perform data cleaning tasks to ensure the quality and integrity of the dataset.


First of all We drop the unnecessary columns that we won't need. In this case the Id and the unamed columns

In [None]:
tv_data.drop(['Unnamed: 0'], axis=1, inplace=True)
tv_data.drop(['ID'], axis=1, inplace=True)

print(tv_data.head())

Next, we check for duplicate rows in the dataset. Duplicate rows can skew analysis results and should be removed if found.

In [None]:
duplicate = tv_data.duplicated().sum()
print("Number of Duplicates: ", duplicate)
if duplicate > 0:
    print(tv_data[tv_data.duplicated()])

We also examine the number of unique values in each column to understand the diversity and uniqueness of the dataset.

In [None]:
tv_data['Year'].unique()
tv_data['Age'].unique()

### Checking for missing Values

We start by checking for any missing values in the dataset. Missing values can hinder analysis and modeling, so it's essential to identify and handle them appropriately.

In [None]:
missing_values = tv_data.isnull().sum()
print("The missing values in the dataset are",missing_values)

In [None]:
## Drop the missing values from the column age 
# tv_data.dropna(subset=['age'], inplace=True)
# print("Data after dropping NA values for age",tv_data)

### Handling missing values

These function is converting the String value of the IMDb rating  into a Float by  removing any non-numeric characters in this case "/" and then taking the first part before / and devide it by the second part and multiply the result by ten.

In [None]:
def convert_rating(rating):
    if isinstance(rating, str):  
        parts = rating.split('/')
        if len(parts) == 2:
            return float(parts[0]) / float(parts[1]) * 10
        else:
            return None
    else:
        return None

Then we apply our convertion  function to our data

In [None]:
tv_data['IMDb'] = tv_data['IMDb'].apply(convert_rating)

print(tv_data)

To handle missing values, we'll replace them with the mean of the IMDb column and place where there are missing values. This helps us maintain the integrity of the dataset while ensuring that missing values are appropriately accounted for.

In [None]:
# Calculate the mean IMDb rating excluding null values
mean_rating = tv_data['IMDb'].mean(skipna=True)

# Replace null values with the mean IMDb rating
tv_data['IMDb'].fillna(mean_rating, inplace=True)

print(tv_data)

we verify now that IMDb column has no missing values

In [None]:
null_values_after = tv_data.isnull().sum()
print("Null values after replacing with mean:\n", null_values_after)

## Data Transformation

In this section, we perform data transformation tasks to prepare the dataset for analysis.


We start by cleaning the 'Age' column, which likely contains characters such as '+' and 'all', as well as missing values (NaN). We replace '+' and 'all' with '' (empty string) and replace NaN with 0 to ensure consistency and completeness.

and then we convert the type of the data from String to integers.


In [None]:
# Replace '+' and 'all' with '' in the 'Age' column and replace NaN with 0
tv_data['Age_num'] = tv_data['Age'].str.replace('+', '').replace('all', '1').fillna(0)

# Convert the 'Age_num' column to integers
tv_data['Age_num'] = tv_data['Age_num'].astype(int)


##print the values of the column Age_num
print(tv_data['Age_num'])

## Outlier Detection

In this section, we check for outliers in the dataset.

Outliers are data points that significantly differ from other observations in the dataset. They can arise due to various reasons, such as measurement errors, data entry mistakes, or genuine anomalies in the data. Outliers can skew statistical analyses and machine learning models, leading to misleading results if not properly handled.

### Checking for Outliers

We use visualizations and statistical techniques to identify outliers in the dataset. One common method is to use box plots and histogranms, which display the distribution of numerical variables and highlight any data points that fall outside the whiskers. 

A histogram provides a visual representation of the distribution of values in the 'Age_num' column. It can help you see the frequency of different age groups.

In [None]:
plt.hist(tv_data['Age_num'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')
plt.show()

## Data Normalization


#### **1-Create Scalers:**

Think of a scalar as a tool that helps us adjust the values of our data to make them easier to work with. We're creating two types of scalers here: one for Min-Max scaling and another for feature normalization.


#### **2-Scale Data:**
Now, we want to change the values in our data to make them fit a specific range or pattern. For example, if we have ratings that go from 1 to 10, we might want to change them so they go from 0 to 1 instead. That's what scaling does.
With Min-Max scaling, we're adjusting our values to fit between 0 and 1.
With feature normalization, we're adjusting our values so that each row of data has a "length" of 1, which makes it easier to compare different rows.

#### **3-Add Scaled Features Back to DataFrame:**
We've now transformed our data, but we want to keep track of both the original and scaled values. So, we're adding two new columns to our data frame: 'IMDb_MinMax' and 'IMDb_Normalize'.
These new columns will hold the scaled values of our original 'IMDb' column.


In [None]:
#create scalers
scaler_minmax = MinMaxScaler()
scaler_normalize = Normalizer()

#scale data
data_minmax = scaler_minmax.fit_transform(tv_data[['IMDb']])  #rescale to [0,1] range

# explain to me what the normalize does and how it works?
data_normalize = scaler_normalize.fit_transform(tv_data[['IMDb']]) 

#add scaled features back to the data frame
tv_data['IMDb_MinMax'] = data_minmax
tv_data['IMDb_Normalize'] = data_normalize

print("Data with min-max scaling")
print(tv_data.head())

print("\n\nData with feature normalization")
print(tv_data.tail())



Saving our cleaned data

In [None]:
tv_data.to_csv('tv_shows_scaled.csv', index=False)