### Understanding Your Data

Before starting any machine learning project, it's essential to understand the structure and characteristics of your data. This step helps you identify key features, data types, missing values, correlations, and any potential issues such as duplicates or outliers. Understanding your data allows you to make informed decisions for preprocessing and model building.

#### Use Cases:
- **Shape & Overview**: Helps in getting a quick idea about the size of the dataset and the types of features it contains.
- **Missing Values**: Identifies if there are missing values that need to be handled.
- **Descriptive Statistics**: Summarizes basic statistical properties like mean, standard deviation, and percentiles.
- **Correlation**: Shows relationships between features, useful for feature selection and detecting multicollinearity.
- **Duplicates**: Identifies duplicate records that might need to be removed for better data quality.

#### Things to Keep in Mind:
- **Missing Data**: Be sure to handle missing values appropriately (impute or remove) to avoid data leakage.
- **Outliers**: Investigate whether the dataset has outliers that need to be treated.
- **Data Types**: Ensure that your features have the correct data types (e.g., categorical, numerical) for further processing.
- **Duplicate Data**: Always check for duplicate rows to avoid skewing your results.
- **Correlations**: Highly correlated features can lead to multicollinearity, which may degrade model performance.

---

By performing these checks, you gain a solid understanding of the data and can proceed with preprocessing and feature engineering confidently.


In [4]:
# Syntax for Understanding Your Data

# Get the shape of the dataset (rows, columns)
df.shape

# Display the first 5 rows of the dataset
df.head()

# Randomly sample 5 rows from the dataset
df.sample(5)

# Get concise information about data types and non-null values
df.info()

# Check for missing values in each column
df.isnull().sum()

# Summary statistics for numerical columns
df.describe()

# Check for duplicate rows
df.duplicated().sum()

# Display correlation matrix between numerical columns
df.corr()

# Correlation of a specific column with others (e.g., 'target_column')
df.corr()['target_column']


### Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions using summary statistics and graphical representations. EDA is a crucial step in understanding the distribution and relationships between variables, preparing the data for modeling, and identifying features that may require transformation.

#### Use Cases:
- **Data Type Identification**: Helps categorize features as numerical, categorical, or mixed.
- **Skewness**: Understands if a distribution is skewed, which may affect model performance.
- **Univariate & Bivariate Analysis**: Examines single-variable distributions and relationships between pairs of variables.
- **Correlation and Interactions**: Detects multicollinearity and relationships between multiple variables using heatmaps, boxplots, and KDE (Kernel Density Estimation) plots.

#### Key Steps:

1. **Data Types (Numerical, Categorical, Mixed)**:
   - Identifying feature types helps guide which visualizations and transformations to apply.
   
2. **Skewness**:
   - Skewness checks are important for understanding the distribution shape of numerical data.
   - Skewed data may require transformations like log or power transformations.

3. **Univariate Analysis**:
   - Looks at the distribution of a single variable. Histograms or KDE plots are used for numerical data, and bar plots for categorical data.

4. **Bivariate Analysis**:
   - Examines relationships between two variables using scatterplots, boxplots, and 2D KDE plots for numerical data.

5. **2D KDE Plot**:
   - Visualizes the bivariate distribution of two numerical variables, providing insight into their joint density.

6. **Boxplot**:
   - Useful for understanding the distribution of numerical data and identifying outliers.

7. **Bar Plot**:
   - Displays the frequency of categorical variables.

8. **Heatmap**:
   - Useful for visualizing correlations between multiple numerical features, helping detect multicollinearity.

9. **Cross Tabulation (pd.crosstab)**:
   - Compares the frequency distribution between two categorical variables.

#### Things to Keep in Mind:
- **Skewness**: Highly skewed data may require transformations.
- **Outliers**: Boxplots help detect outliers, which may need special handling.
- **Multicollinearity**: Heatmaps help in detecting highly correlated features, which might need to be removed.
- **Mixed Variables**: Features with mixed data types (numerical & categorical) require special handling, such as binning for continuous data.

---

Performing EDA helps you better understand the data, decide on feature engineering steps, and improve the quality of inputs for your machine learning models.


In [None]:
# Syntax for EDA

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data Types (Numerical, Categorical)
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

# Skewness
print(df[numerical_columns].skew())

# Univariate Analysis (Histogram for Numerical Columns)
df[numerical_columns].hist(figsize=(10, 8))

# Bar Plot for Categorical Data
df[categorical_columns[0]].value_counts().plot(kind='bar')

# Bivariate Analysis (2D KDE Plot)
sns.kdeplot(x='num_col1', y='num_col2', data=df, cmap='coolwarm', shade=True)

# Boxplot for Numerical and Categorical Data
sns.boxplot(x='categorical_column', y='numerical_column', data=df)

# Heatmap of Correlations
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

# Cross Tabulation
pd.crosstab(df['categorical_col1'], df['categorical_col2'])


### What is Feature Engineering?

Feature Engineering is the process of using domain knowledge to extract or create new input features from raw data, improving the performance of machine learning models. It involves transforming data into formats that are better suited for algorithms, and it can greatly impact the accuracy and performance of the models.

#### Use Cases:
- **Feature Scaling**: Adjusts numerical features to a standard scale, which is crucial for algorithms sensitive to feature magnitude, such as distance-based algorithms (e.g., k-NN, SVM, etc.).
- **Feature Construction**: Involves creating new features by combining existing ones. This can capture important patterns that aren't explicitly present in the raw data.
- **Feature Selection**: Reduces the dimensionality of the dataset by selecting the most important features, which can help improve model performance and reduce overfitting.
- **Feature Extraction**: Derives new features from raw data through techniques like PCA, which can reduce redundancy in the data and highlight important patterns.

#### Key Steps:

1. **Feature Scaling**:
   - Ensures that numerical data is on the same scale, which helps machine learning models perform better.
   - Common techniques include:
     - **Standardization**: Rescales features to have zero mean and unit variance.
     - **Normalization**: Rescales features to a range of 0-1 using Min-Max scaling.

2. **Feature Construction**:
   - Combines or transforms existing features to create new ones, such as adding interaction terms, binning continuous data, or aggregating features.
   - Example: Constructing "total_expense" from "monthly_expense" and "number_of_months".

3. **Feature Selection**:
   - Identifies the most important features for the model, reducing overfitting and improving generalization.
   - Techniques include:
     - **Correlation-based**: Removes highly correlated features.
     - **Wrapper methods**: Evaluates feature subsets by training models.
     - **Embedded methods**: Uses algorithms like Lasso to perform feature selection.

4. **Feature Extraction**:
   - Reduces dimensionality by extracting features that encapsulate the most variance in the data. Principal Component Analysis (PCA) is commonly used here.

#### Things to Keep in Mind:
- **Scaling**: Most ML algorithms assume features are on the same scale, especially distance-based models.
- **Construction**: Ensure that the constructed features are meaningful and capture the right relationships.
- **Selection**: Avoid using too many or too few features, as this can lead to overfitting or underfitting.
- **Extraction**: Feature extraction can help in high-dimensional datasets where reducing the number of features is critical.

---

Proper feature engineering helps models learn from data more effectively, leading to improved predictions and performance.


In [None]:
# Syntax for Feature Engineering

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif

# Feature Scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[numerical_columns])

minmax_scaler = MinMaxScaler()
normalized_data = minmax_scaler.fit_transform(df[numerical_columns])

# Feature Construction (Creating a new feature from existing ones)
df['total_expense'] = df['monthly_expense'] * df['number_of_months']

# Feature Selection (Select K Best based on ANOVA F-value)
selector = SelectKBest(score_func=f_classif, k=5)
selected_features = selector.fit_transform(df[numerical_columns], df['target'])

# Feature Extraction (PCA)
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)


### Pandas Code Snippet

In [None]:
# Drop rows with missing values directly in the original DataFrame
df.dropna(inplace=True)

# Fill missing values with 0 in the original DataFrame
df.fillna(0, inplace=True)

# Drop a specific column 'age' from the DataFrame
df.drop('age', axis=1, inplace=True)

# Sort the DataFrame by the 'salary' column
df.sort_values(by='salary', inplace=True)


### Pandas Profiling

Pandas Profiling is a powerful tool for automating Exploratory Data Analysis (EDA). It quickly generates a comprehensive report of your dataset, summarizing key statistics and identifying potential data quality issues. This tool is especially useful for getting an overview of the dataset, such as descriptive statistics, correlations, missing data, and data types, without manually writing multiple code blocks. It provides detailed visualizations of distributions, correlations, and interactions, saving time and effort during initial data exploration.

**Why Use It?**  
- Provides a **quick EDA** summary of the dataset, offering detailed insights into data distributions, missing values, correlations, and feature statistics.
- Highlights data quality issues such as missing values, outliers, high cardinality, and duplicates.
- **Efficient for initial data exploration**, making it easier to understand the structure and issues of the dataset before feature engineering or model building.
  
**Key Features**:
- Overview of dataset, including total missing values, duplicates, and feature cardinality.
- Visual summaries of **univariate distributions**, correlations, and missing data patterns.
- Flags **warnings** for potential issues in the dataset, helping you detect outliers or problematic data.
  
**Keep in Mind**:
- For large datasets, generating a profiling report can be slow.
- Use this as an early step in data exploration to gain insights into data quality and potential areas for feature engineering or cleaning.
- Be mindful of the report's sensitivity when working with private or sensitive data.

---


In [None]:
# Install pandas profiling if it's not installed
# pip install pandas-profiling

import pandas_profiling

# Generate a pandas profiling report for your DataFrame
profile = df.profile_report(title="Pandas Profiling Report")

# Display the report in a Jupyter Notebook
profile.to_notebook_iframe()

# Optionally, save the report to an HTML file for further analysis
profile.to_file("output_report.html")


### Feature Scaling - Standardization

**Standardization** (also known as Z-score normalization or mean centering) is a scaling technique where the features of the data are transformed such that they have a **mean of 0** and a **standard deviation of 1**. This is especially important for algorithms that are sensitive to the scale of the data, as it ensures that all features contribute equally during model training.

**Why Use It?**  
Some machine learning algorithms are sensitive to the scale of features, and it is important to normalize the range of all features to ensure they have comparable contributions to the model. Algorithms like **KMeans**, **K Nearest Neighbours (KNN)**, **PCA (Principal Component Analysis)**, and **Artificial Neural Networks** are particularly sensitive to the scales of the input data.

**Use Cases**:
- **KMeans Clustering**: Distance-based algorithms like KMeans require features to be on the same scale.
- **KNN (K-Nearest Neighbors)**: For distance calculations in KNN, all features should be scaled appropriately.
- **PCA**: Ensures that the principal components are influenced equally by all features.
- **Neural Networks**: Helps speed up training and avoids certain features dominating the learning process due to larger magnitude.

**Keep in Mind**:
- Standardization works well with normally distributed features.
- It can sometimes distort the importance of features that do not need to be scaled. Only use it for models that require features to be on the same scale.
- Apply **standardization** only on the training data, and later use the same scaling factors for the test data.

---

```python
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Use the same transformation on test data
X_test_scaled = scaler.transform(X_test)


### Feature Scaling - Normalization | MinMaxScaling | MaxAbsScaling | RobustScaling

**Normalization** involves rescaling the features of your dataset to a specific range, typically [0, 1] or [-1, 1], depending on the scaling technique. This is essential for machine learning models that are sensitive to the magnitude of input features.

**Why Use It?**  
Many machine learning algorithms like **logistic regression**, **support vector machines (SVM)**, and **neural networks** can perform better when features are on the same scale. Different scaling techniques suit different types of data and tasks.

**Types of Scaling**:
- **MinMaxScaling**: Scales features to a given range, usually [0, 1]. Best suited for algorithms that assume the features are bounded within a specific range.
- **MaxAbsScaling**: Scales features by their maximum absolute value, leaving the sign unchanged. Ideal for data that is already centered at zero but needs rescaling.
- **RobustScaling**: Uses the median and interquartile range (IQR) for scaling, making it more robust to outliers compared to MinMaxScaling.

**Keep in Mind**:
- **Fit** the scaler on the **training data only** to avoid data leakage.
- **Transform** both training and test datasets with the same scaler to ensure consistency.
- MinMaxScaling can be sensitive to outliers, while RobustScaling handles them better.
  
---

```python
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler, RobustScaler

# Initialize the scalers
min_max_scaler = MinMaxScaler()
max_abs_scaler = MaxAbsScaler()
robust_scaler = RobustScaler()

# Fit on the training data only and transform both train and test data
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_test_minmax = min_max_scaler.transform(X_test)

X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_test_maxabs = max_abs_scaler.transform(X_test)

X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)


### Encoding Categorical Data | Ordinal Encoding | Label Encoding

**Categorical Encoding** is essential when dealing with features that contain categorical data, as most machine learning algorithms expect numerical input. Two common types of encoding are **Ordinal Encoding** and **Label Encoding**.

**Why Use It?**
Machine learning algorithms cannot handle raw categorical data, and encoding is necessary to convert these categories into numerical values that models can understand.

#### Types of Encoding:
- **Ordinal Encoding**: This encoding is used when the categorical variable has an inherent order (e.g., "low", "medium", "high"). Each category is assigned an integer based on the rank.
- **Label Encoding**: Assigns a unique integer to each category, with no inherent order. This is useful for nominal data, where no ordering exists.

**Use Cases**:
- **Ordinal Encoding** is suitable for ordinal features with meaningful rankings.
- **Label Encoding** is generally used for nominal features (without order) but can introduce issues with algorithms that assume some ordering from integer values.

**Column Transformer**: This is used when you need to apply different preprocessing techniques (like encoding or scaling) to different columns of your dataset. It is especially useful for pipelines with both categorical and numerical features.

**Keep in Mind**:
- Ordinal encoding should only be applied when the order of categories matters.
- Label encoding can mislead some models if there is no inherent ordering but numbers are assigned. Consider **One-Hot Encoding** for such cases.
- Fit the encoder on the training data and transform both train and test datasets.

---

```python
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Sample data
X = [['low'], ['medium'], ['high']]

# Ordinal Encoding (for ordinal features)
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ordinal_encoded = ordinal_encoder.fit_transform(X)

# Label Encoding (for nominal features)
label_encoder = LabelEncoder()
y = ['cat', 'dog', 'mouse']
y_label_encoded = label_encoder.fit_transform(y)

# Column Transformer - Example (if needed)
column_transformer = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), ['categorical_column']),
    ('num', 'passthrough', ['numerical_column'])
])

# Apply the transformations to the dataset
X_transformed = column_transformer.fit_transform(X_train)


### One Hot Encoding | Handling Categorical Data

**One-Hot Encoding** is a technique used to convert categorical data into a binary (0 or 1) matrix format. Each unique category is transformed into a separate column, where the presence of a category is marked as 1 and the absence as 0.

**Why Use It?**  
Unlike **Label Encoding**, which can introduce a false ordinal relationship between categories, One-Hot Encoding ensures that no such ordering is implied. It's particularly useful for nominal categorical data where categories have no order.

#### Multicollinearity Concern:
- **Multicollinearity** can arise when the columns created by one-hot encoding are not independent. To avoid this, we drop one column from the set of n categories (thus creating n-1 columns). This helps avoid the **Dummy Variable Trap**, where the presence of a category is perfectly predicted by the other categories.

**Key Points to Keep in Mind**:
- Use **One-Hot Encoding** when categories are **nominal** and there's no natural order.
- Always ensure input columns are **independent** after encoding.
- For categorical features with n categories, One-Hot Encoding will create **n-1 columns** to avoid multicollinearity.
- Avoid the dummy variable trap by dropping one category from the encoded matrix.

---

```python
import pandas as pd

# Example dataset
df = pd.DataFrame({
    'City': ['Karachi', 'Lahore', 'Islamabad', 'Karachi']
})

# One-Hot Encoding using pandas
# We use drop_first=True to avoid the Dummy Variable Trap (n-1 columns)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)

# Example output would be:
#    City_Lahore  City_Islamabad  City_Karachi
# 0            0               0             1
# 1            1               0             0
# 2            0               1             0
# 3            0               0             1


### Column Transformer in Machine Learning

The **Column Transformer** is a useful tool in machine learning when you need to apply different preprocessing techniques to different types of features (columns). For instance, you might want to apply **scaling** to numerical data while applying **encoding** to categorical data simultaneously.

**Why Use It?**  
It streamlines the preprocessing of datasets that contain both numerical and categorical variables, enabling you to handle each column type appropriately without manually applying transformations to each column.

#### Key Use Cases:
- **Numerical columns**: You may want to scale or normalize these features.
- **Categorical columns**: You can apply encoders such as **One-Hot Encoding** or **Ordinal Encoding**.
- **Pipelines**: The Column Transformer works well in conjunction with machine learning pipelines for streamlined workflows.

#### Important Points to Consider:
- Ensure the right transformation is applied to the correct column type.
- You can chain transformers together to process different columns efficiently.
- It's essential to fit the column transformer on training data and then apply it to both training and test sets to avoid **data leakage**.

---

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Sample data
import pandas as pd

df = pd.DataFrame({
    'age': [25, 32, 40, 50],
    'city': ['Karachi', 'Lahore', 'Islamabad', 'Lahore'],
    'salary': [50000, 60000, 65000, 70000]
})

# Column Transformer setup
column_transformer = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['age', 'salary']),  # Scaling numerical data
    ('cat', OneHotEncoder(drop='first'), ['city'])  # One-Hot Encoding categorical data
])

# Fit and transform the data
X_transformed = column_transformer.fit_transform(df)

# The numerical columns are scaled, and the categorical column is one-hot encoded


### Function Transformer | Log Transform | Reciprocal Transform | Square Root Transform

**Function Transformers** allow you to apply mathematical transformations to your features. Some of the common transformations include:
- **Log Transform**: Helps handle skewed data by reducing the effect of large outliers. Used when data is highly right-skewed.
- **Reciprocal Transform**: Similar to log transform but stronger. Works well with extreme values.
- **Square Root Transform**: Less aggressive than log, used for moderately skewed data.

These transformations are typically used when you want to make non-normally distributed data more Gaussian (normal), which helps improve model performance for algorithms that assume normality (e.g., linear regression, SVM).

#### How to Check if Distribution is Normal?

1. **sns.distplot()**: This plots the distribution of your data.
   - If the plot shows a bell-shaped curve, the data is normally distributed.
   
2. **pd.skew()**: Returns the skewness of the data. 
   - If the skew is close to 0, the data is symmetric. 
   - Positive values indicate right-skewed data, and negative values indicate left-skewed data.
   
3. **QQ Plot**: (Quantile-Quantile plot) checks the normality by comparing the quantiles of your data against a theoretical normal distribution.
   - If the points lie on the diagonal line, the data is normally distributed.

#### Key Points:
- **Log Transform**: Use for right-skewed data.
- **Reciprocal Transform**: Use for data with extreme values.
- **Square Root Transform**: Use for moderately skewed data.
- Checking the data's distribution is crucial before applying transformations.
- After transforming, always recheck the distribution.

---

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Sample data
df = pd.DataFrame({
    'income': [1000, 3000, 10000, 50000, 100000, 200000]
})

# Checking Distribution with Seaborn's distplot
sns.distplot(df['income'])
plt.show()

# Checking Skewness
print(f"Skewness: {df['income'].skew()}")

# QQ Plot to check normality
stats.probplot(df['income'], dist="norm", plot=plt)
plt.show()

# Applying Log Transform to reduce skewness
df['income_log'] = np.log(df['income'] + 1)

# Checking the transformed distribution
sns.distplot(df['income_log'])
plt.show()

# Checking skewness after transformation
print(f"Skewness after log transform: {df['income_log'].skew()}")
