<a href="https://colab.research.google.com/github/RifatMuhtasim/Data_Science_Workflow/blob/main/1.1.Data_Load_And_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load the **Dataset**
---

## Save Dataset on Colab

In [None]:
# For Zip File

import gdown
import os
import zipfile

# Replace 'output_path' with the path where you want to save the file
output_path = 'Robi_Datathon.zip'

if os.path.exists(output_path):
    print("File exists!")

else:
    print("File does not exist.")
    # Replace 'file_id' with the ID of your file in Google Drive
    file_id = '1fx0yBWwashiH2hODzEACgwacwqfd-Pq0'
    gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path, quiet=False)

    # Path to your .zip file (Must Change. Same as the Output Path)
    zip_file_path = '/content/Robi_Datathon.zip'

    # Extract the contents of the .zip file to the root directory
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall('/content/')

    # List the contents of the root directory
    extracted_files = !ls -a /content/
    print("Files extracted to root directory:", extracted_files)

In [None]:
# For Single File

import gdown
import os

# Replace 'output_path' with the path where you want to save the file
output_path = 'co2_emissions_train.csv'

if os.path.exists(output_path):
    print("File exists!")
else:
    print("File does not exist.")
    # Replace 'file_id' with the ID of your file in Google Drive
    file_id = '11zlCerdOfSBcTtFYHlz4DRgzjQeTGa8N'
    gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path, quiet=False)

In [None]:
# Load the dataset

# From csv
df = pd.read_csv("/content/abcd.csv")

# From .tsv
df = pd.read_csv("abcd.tsv", delimiter="\t")

# From excel
df = pd.read_excel("abcd.xlsx", sheet_name='sheet1')

# From json
df = pd.read_json("abcd.json")

In [None]:
# Check the dimensions of the dataset

df.shape
print("Number of Rows:" , df.shape[0])
print("Number of Columns:", df.shape[1])

# Handle Missing Value
---

## Check for data types and missing values

In [None]:
# Inspect data types

df.info()

# or
df.dtypes

In [None]:
# Identify missing values in each column

missing_values = df.isna().sum()

# Determine the extent of missing values (as a percentage)
missing_percentage = (missing_values / len(df)) * 100

In [None]:
# Create a Dataframe to show the missing value with it's percentage

total_missing_value = train.isna().sum().sort_values(ascending=False)
percent_missing_value = train.isna().mean().sort_values(ascending=False) * 100
missing_data = pd.concat([total_missing_value, percent_missing_value], axis="columns", keys=['Total', 'Percent'])
missing_data

### Determine the appropriate strategy for handling missing values
Based on the extent and nature of missing values, you can decide on strategies like:
1. Dropping rows or columns with missing values
2. Imputation (filling missing values with a specific value, e.g., mean, median, mode)
  - Univariate:
    - Numerical: Mean, Median, Mode, End of the Distribution
    - Categorical: Mode, 'Missing'
  - Multivariate:
    - KNN Impute
    - Iterative Impute
3. Using interpolation methods
4. Forward and Backward Technique
5. Depending on the context, keeping missing values as is


## # 1. Drop Columns and Rows

In [None]:
# Example of dropping rows with missing values
df.dropna(inplace=True)  # Drop rows with any missing values

# Example of dropping columns with missing value
df.dropna(axis=1, inplace=True)

## # 2. Imputation

### 2. Univariate (Numerical)

In [None]:
# Example of imputation using mean
df.fillna(df.mean(), inplace=True)  # Replace missing values with the mean of the column

# Example of imputation using median
df.fillna(df.median(), inplace=True)  # Replace missing values with the median of the column

# Example of imputation using mode
df.fillna(df.mode().iloc[0], inplace=True)  # Replace missing values with the mode of the column

# End of the distribution
lower_tail_value = df['column_name'].quantile(0.05)
df['column_name'].fillna(lower_tail_value, inplace=True)

###  2. Univariate (Categorical)

In [None]:
# Example of imputation using mode (for categorical data)
df.fillna(df.mode().iloc[0], inplace=True)  # Replace missing values with the mode of the column

# Example of fill missing value with "Missing" tag
df['column_name'].fillna("Missing", inplace=True)

### 2. Multivariate

#### KNN Impute

In [None]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)

features_dataset = dataset.drop(['label'], axis="columns")
label_dataset = dataset['label']
features_dataset.iloc[:, :] = knn_imputer.fit_transform(features_dataset)
dataset = pd.concat([features_dataset, label_dataset], axis="columns")


# Iterative Impute (MICE)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
iterative_imputer = IterativeImputer(max_iter=500)

features_dataset = dataset.drop(['label'], axis="columns")
label_dataset = dataset['label']
features_dataset.iloc[:, :] = iterative_imputer.fit_transform(features_dataset)
dataset = pd.concat([features_dataset, label_dataset], axis="columns")

#### Iterative Imputation

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

def imputation(df):
    imputer = IterativeImputer(missing_values=np.nan,
                            random_state=0,
                            n_nearest_features=3,
                            max_iter=1,
                            sample_posterior=True)

    df_imp = imputer.fit_transform(df)
    df = pd.DataFrame(df_imp, columns=df.columns.tolist())
    return df

# y = train_df['Class']
# train_df.drop(columns='Class', inplace=True)
train_num = imputation(train_df[[col for col in train_df.select_dtypes("number")]])
train_cat = train_df[[col for col in train_df.select_dtypes(["category", "object"])]]
train_df = pd.concat([train_cat, train_num], axis=1)
train_df.head()

## # 3. Iterpolation Technique

In [None]:
# Fill missing value with Linear Interpolate
new_df = df.interpolate(method="linear")

# Fill missing value with Time Interpolate
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
new_df = df.interpolate(method="time")

## 4. Forward & Backward Technique

In [None]:
df_filled = df.fillna(method="ffill")
df_filled = df.fillna(method="bfill")

## 5. Random Imputation

In [None]:
# Generate random values from the distribution of non-missing values
non_missing_values = df['column_name'].dropna()
random_values = np.random.choice(non_missing_values, size=df['column_name'].isnull().sum())
df.loc[df['column_name'].isnull(), 'column_name'] = random_values

### 5. Missing Indicator

In [None]:
# Add Missing Indication in columns
from sklearn.impute import SimpleImputer

si_imputer = SimpleImputer(add_indicator=True)
df.iloc[:, :] = si_imputer.fit_transform(df)

Itâ€™s important to understand the reasons behind missing data:

Identifying the type of missing data: Is it Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR)?
Evaluating the impact of missing data: Is the missingness causing bias or affecting the analysis?<br/>
Choosing appropriate handling strategies: Different techniques are suitable for different types of missing data.<br/><br/>
Types of Missing Values?<br/>
There are three main types of missing values:?<br/>

Missing Completely at Random (MCAR): MCAR is a specific type of missing data in which the probability of a data point being missing is entirely random and independent of any other variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the characteristics of the data point itself.


Missing at Random (MAR): MAR is a type of missing data where the probability of a data point missing depends on the values of other variables in the dataset, but not on the missing variable itself. This means that the missingness mechanism is not entirely random, but it can be predicted based on the available information.


Missing Not at Random (MNAR): MNAR is the most challenging type of missing data to deal with. It occurs when the probability of a data point being missing is related to the missing value itself. This means that the reason for the missing data is informative and directly associated with the variable that is missing.


# Handle Duplicates
---

## # Identify Duplicates

In [None]:
# Show the number of Duplicates value

df.duplicated().sum()

## Drop Duplicates

In [None]:
df_no_duplicates = df.drop_duplicates()