# Machine Learning Intro | Assignment

##Question 1: Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS).


*  the differences between AI, ML, Deep Learning (DL), and Data Science (DS):

*   **Artificial Intelligence (AI)**: This is the broadest concept. AI is the simulation of human intelligence in machines that are programmed to think and learn like humans. It's about creating intelligent agents that can reason, plan, perceive, learn, and act.

*   **Machine Learning (ML)**: ML is a subset of AI. It's a field of study that gives computers the ability to learn without being explicitly programmed. Instead of writing code for every possible scenario, ML algorithms learn from data and make predictions or decisions based on that learning.

*   **Deep Learning (DL)**: DL is a subset of ML. It's inspired by the structure and function of the human brain (artificial neural networks). Deep learning models use multiple layers of these networks to process data and learn complex patterns. DL is particularly effective for tasks involving images, speech, and natural language.

*   **Data Science (DS)**: Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured. It encompasses various techniques and fields, including statistics, ML, data analysis, and domain expertise, to understand and interpret complex data.

**Think of it like this:**

*   **AI** is the overarching goal: making machines intelligent.
*   **ML** is one way to achieve AI: by enabling machines to learn from data.
*   **DL** is a specific technique within ML: using deep neural networks for complex pattern recognition.
*   **Data Science** is a field that *uses* AI and ML (among other techniques) to analyze data and extract valuable insights.

##Question 2: What are the types of machine learning? Describe each with one real-world example

* There are three main types of machine learning:

1.  **Supervised Learning**: In supervised learning, the model is trained on a labeled dataset, meaning the data includes both the input features and the desired output. The goal is for the model to learn a mapping from the input to the output so it can predict the output for new, unseen data.
    *   **Real-world example**: **Spam email detection**. The model is trained on a dataset of emails labeled as "spam" or "not spam". It learns to identify patterns in the email content (like specific words or phrases) that indicate whether an email is spam and can then predict if a new email is spam or not.

2.  **Unsupervised Learning**: In unsupervised learning, the model is trained on an unlabeled dataset. The goal is to find hidden patterns or structures in the data without any predefined output.
    *   **Real-world example**: **Customer segmentation**. A company might use unsupervised learning to group its customers based on their purchasing behavior, demographics, or other attributes. This helps them understand different customer groups and tailor marketing strategies accordingly.

3.  **Reinforcement Learning**: In reinforcement learning, an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions, and its goal is to learn a policy that maximizes the cumulative reward over time.
    *   **Real-world example**: **Training a robot to walk**. The robot is the agent, and the environment is the physical world. The robot receives rewards for moving forward and penalties for falling. Through trial and error, the robot learns the sequence of movements (actions) that allow it to walk successfully.

##Question 3: Define overfitting, underfitting, and the bias-variance tradeoff in machine learning.


* Here are the definitions for overfitting, underfitting, and the bias-variance tradeoff:

*   **Overfitting**: Overfitting occurs when a machine learning model learns the training data too well, including the noise and outliers. This results in a model that performs very well on the training data but poorly on new, unseen data because it has essentially memorized the training examples instead of learning the underlying patterns. An overfitted model is too complex for the data.

*   **Underfitting**: Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the training data. The model doesn't learn the relationships between the features and the target variable effectively, resulting in poor performance on both the training data and new data. An underfitted model is not complex enough for the data.

*   **Bias-Variance Tradeoff**: This is a fundamental concept in machine learning. It refers to the conflict between minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training data:
    *   **Bias**: Bias is the error introduced by approximating a real-world problem, which may be complicated, by a simplified model. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
    *   **Variance**: Variance is the error introduced from the sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

*  The **bias-variance tradeoff** is the challenge of simultaneously minimizing both bias and variance. Decreasing bias often increases variance, and vice versa. The goal is to find a balance that minimizes the total error and allows the model to generalize well to new data.

## Question 4: What are outliers in a dataset, and list three common techniques for handling them

* Here's an explanation of outliers and common techniques for handling them:

*   **Outliers**: Outliers are data points that are significantly different from other observations in a dataset. They are extreme values that lie far away from the majority of the data. Outliers can occur due to various reasons, such as measurement errors, data entry mistakes, or genuinely rare events. They can negatively impact the performance of machine learning models by distorting statistical measures and model training.

*   **Common techniques for handling outliers**:

    1.  **Removal (Deletion)**: This is the simplest technique, where outlier data points are simply removed from the dataset. This is often done when the outlier is suspected to be due to an error and doesn't represent the true data distribution. However, removing outliers can lead to loss of valuable information and should be done cautiously, especially in small datasets.

    2.  **Transformation**: This involves applying mathematical transformations to the data to reduce the impact of outliers. Common transformations include:
        *   **Log transformation**: Useful for skewed data, as it compresses larger values and expands smaller values.
        *   **Square root transformation**: Similar to log transformation, it reduces the spread of the data.
        *   **Box-Cox transformation**: A family of power transformations that can be applied to make the data more normally distributed.

    3.  **Imputation**: Instead of removing outliers, you can replace them with a more representative value. This can be done using various imputation techniques, such as:
        *   **Mean, median, or mode imputation**: Replacing the outlier with the mean, median, or mode of the non-outlier data. The median is often preferred as it is less affected by extreme values.
        *   **Model-based imputation**: Using a machine learning model to predict the outlier value based on other features in the dataset.
        *   **Winsorizing**: Capping the outliers at a certain percentile (e.g., replacing values above the 95th percentile with the value at the 95th percentile).

* The choice of technique depends on the nature of the data, the suspected cause of the outliers, and the specific machine learning model being used. It's often a good practice to analyze the impact of different techniques on model performance.

## Question 5: Explain the process of handling missing values and mention one imputation technique for numerical and one for categorical data.


* Here's an explanation of handling missing values and some imputation techniques:

* **Handling Missing Values**

* Missing values, also known as NaNs (Not a Number), are common in real-world datasets and can significantly impact the performance of machine learning models. Handling missing values is a crucial step in data preprocessing. The process typically involves:

1.  **Identification**: Detecting where the missing values are located in the dataset. This can be done by checking for null or missing values in each column.
2.  **Analysis**: Understanding the pattern and extent of missing values. Are they random, or are they related to other features? The proportion of missing values can also influence the handling strategy.
3.  **Handling Strategy**: Deciding how to address the missing values. Common strategies include:
    *   **Deletion**: Removing rows or columns with missing values. This is suitable if the number of missing values is small and doesn't lead to significant data loss.
    *   **Imputation**: Replacing missing values with estimated values. This is often preferred when deletion would result in a substantial loss of data.
    *   **Ignoring**: Some machine learning algorithms can handle missing values internally, so you might choose to leave them as they are.

* **Imputation Techniques**

* Imputation is the process of replacing missing values with substitute values. The choice of imputation technique depends on the type of data (numerical or categorical) and the nature of the missingness.

*   **Imputation Technique for Numerical Data**:

    *   **Mean/Median Imputation**: Replacing missing numerical values with the mean or median of the non-missing values in that column. The median is often more robust to outliers.

*   **Imputation Technique for Categorical Data**:

    *   **Mode Imputation**: Replacing missing categorical values with the mode (most frequent category) of the non-missing values in that column.

* Other more advanced imputation techniques exist, such as using k-nearest neighbors (KNN) or model-based imputation, but mean/median and mode imputation are simple and commonly used methods.

##Question 6: Write a Python program that:
###● Creates a synthetic imbalanced dataset with make_classification() from sklearn.datasets.
###● Prints the class distribution.
(Include your Python code and output in the code box below.)

In [None]:
from collections import Counter
from sklearn.datasets import make_classification

# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, # Total number of samples
                           n_features=20, # Number of features
                           n_informative=2, # Number of informative features
                           n_redundant=10, # Number of redundant features
                           n_clusters_per_class=1, # Number of clusters per class
                           weights=[0.9, 0.1], # The proportions of samples assigned to each class
                           flip_y=0.01, # The fraction of samples whose class is randomly assigned
                           class_sep=0.8, # The separation between classes
                           random_state=1) # Seed for reproducibility

# Print the class distribution
print("Class distribution:", Counter(y))

##Question 7: Implement one-hot encoding using pandas for the following list of colors:
##['Red', 'Green', 'Blue', 'Green', 'Red']. Print the resulting dataframe.
(Include your Python code and output in the code box below.)

In [None]:
import pandas as pd

# List of colors
colors = ['Red', 'Green', 'Blue', 'Green', 'Red']

# Create a pandas Series from the list
colors_series = pd.Series(colors)

# Implement one-hot encoding
one_hot_encoded_df = pd.get_dummies(colors_series)

# Print the resulting dataframe
print(one_hot_encoded_df)

##Question 8: Write a Python script to:
###● Generate 1000 samples from a normal distribution.
###● Introduce 50 random missing values.
###● Fill missing values with the column mean.
###● Plot a histogram before and after imputation.
(Include your Python code and output in the code box below.)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate 1000 samples from a normal distribution
np.random.seed(42) # for reproducibility
data = np.random.normal(loc=0, scale=1, size=1000)
data_series = pd.Series(data)

# Introduce 50 random missing values
missing_indices = np.random.choice(len(data_series), size=50, replace=False)
data_with_missing = data_series.copy()
data_with_missing[missing_indices] = np.nan

# Plot histogram before imputation
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(data_with_missing.dropna(), bins=30, edgecolor='black')
plt.title('Histogram Before Imputation')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Fill missing values with the column mean
mean_value = data_with_missing.mean()
data_imputed = data_with_missing.fillna(mean_value)

# Plot histogram after imputation
plt.subplot(1, 2, 2)
plt.hist(data_imputed, bins=30, edgecolor='black')
plt.title('Histogram After Mean Imputation')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##Question 9: Implement Min-Max scaling on the following list of numbers [2, 5, 10, 15,20] using sklearn.preprocessing.MinMaxScaler. Print the scaled array.
(Include your Python code and output in the code box below.)

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# List of numbers
data = np.array([2, 5, 10, 15, 20]).reshape(-1, 1) # Reshape for MinMaxScaler

# Implement Min-Max scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

# Print the scaled array
print(scaled_data)

##Question 10: You are working as a data scientist for a retail company. You receive a customer
##transaction dataset that contains:
###● Missing ages,
###● Outliers in transaction amount,
###● A highly imbalanced target (fraud vs. non-fraud),
###● Categorical variables like payment method.
###Explain the step-by-step data preparation plan you’d follow before training a machine learning model. Include how you’d address missing data, outliers, imbalance, and encoding.
(Include your Python code and output in the code box below.)


# Task
Explain a step-by-step data preparation plan for a customer transaction dataset with missing ages, outliers in transaction amount, a highly imbalanced target (fraud vs. non-fraud), and categorical variables like payment method, before training a machine learning model. Include how to address missing data, outliers, imbalance, and encoding.

## Understand the data

### Subtask:
Load the dataset and get a general understanding of its structure, features, and the extent of the issues mentioned (missing values, outliers, imbalance).


**Reasoning**:
The first step is to load the dataset to understand its structure and identify the issues mentioned in the problem description. Since the dataset is not provided, I will create a synthetic dataset that simulates the characteristics described in the problem to demonstrate the data preparation steps.



In [None]:
import pandas as pd
import numpy as np

# Create a synthetic dataset simulating the described issues
np.random.seed(42) # for reproducibility

n_samples = 10000
# Simulate age with some missing values
age = np.random.randint(18, 90, size=n_samples)
missing_age_indices = np.random.choice(n_samples, size=int(0.05 * n_samples), replace=False)
age[missing_age_indices] = np.nan

# Simulate transaction amount with some outliers
transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)
outlier_indices = np.random.choice(n_samples, size=int(0.01 * n_samples), replace=False)
transaction_amount[outlier_indices] = np.random.uniform(200, 1000, size=int(0.01 * n_samples)) # Introduce outliers

# Simulate payment method (categorical)
payment_method = np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Bank Transfer'], size=n_samples)

# Simulate a highly imbalanced target variable (fraud vs. non-fraud)
target = np.zeros(n_samples, dtype=int)
fraud_indices = np.random.choice(n_samples, size=int(0.02 * n_samples), replace=False)
target[fraud_indices] = 1

# Create DataFrame
df = pd.DataFrame({
    'Age': age,
    'Transaction_Amount': transaction_amount,
    'Payment_Method': payment_method,
    'Fraud': target
})

# Display the first few rows
print("First 5 rows of the dataset:")
display(df.head())

# Get info about the dataset
print("\nDataset Info:")
df.info()

# Get descriptive statistics of numerical columns
print("\nDescriptive Statistics:")
display(df.describe())

# Check the distribution of the target variable
print("\nTarget Variable Distribution:")
display(df['Fraud'].value_counts())

**Reasoning**:
The previous attempt to introduce missing values in the 'Age' column failed because the column was created with an integer data type, which cannot hold NaN (a float). To fix this, I need to ensure the 'Age' column can accommodate float values before introducing missing data. I will recreate the synthetic dataset with this correction and proceed with the initial data exploration steps.



In [None]:
# Create a synthetic dataset simulating the described issues
np.random.seed(42) # for reproducibility

n_samples = 10000
# Simulate age with some missing values, initialize as float to allow NaN
age = np.random.randint(18, 90, size=n_samples).astype(float)
missing_age_indices = np.random.choice(n_samples, size=int(0.05 * n_samples), replace=False)
age[missing_age_indices] = np.nan

# Simulate transaction amount with some outliers
transaction_amount = np.random.normal(loc=50, scale=20, size=n_samples)
outlier_indices = np.random.choice(n_samples, size=int(0.01 * n_samples), replace=False)
transaction_amount[outlier_indices] = np.random.uniform(200, 1000, size=int(0.01 * n_samples)) # Introduce outliers

# Simulate payment method (categorical)
payment_method = np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Bank Transfer'], size=n_samples)

# Simulate a highly imbalanced target variable (fraud vs. non-fraud)
target = np.zeros(n_samples, dtype=int)
fraud_indices = np.random.choice(n_samples, size=int(0.02 * n_samples), replace=False)
target[fraud_indices] = 1

# Create DataFrame
df = pd.DataFrame({
    'Age': age,
    'Transaction_Amount': transaction_amount,
    'Payment_Method': payment_method,
    'Fraud': target
})

# Display the first few rows
print("First 5 rows of the dataset:")
display(df.head())

# Get info about the dataset
print("\nDataset Info:")
df.info()

# Get descriptive statistics of numerical columns
print("\nDescriptive Statistics:")
display(df.describe())

# Check the distribution of the target variable
print("\nTarget Variable Distribution:")
display(df['Fraud'].value_counts())

## Handle missing values

### Subtask:
Address the missing ages, likely using an imputation technique such as median imputation since age can have a skewed distribution.


**Reasoning**:
Calculate the median of the 'Age' column and fill the missing values with this median. Then, verify that the missing values have been filled.



In [None]:
# Calculate the median of the 'Age' column
median_age = df['Age'].median()
print(f"Median age: {median_age}")

# Fill missing values in the 'Age' column with the median
df['Age'].fillna(median_age, inplace=True)

# Verify that there are no more missing values in the 'Age' column
print("\nMissing values in 'Age' after imputation:")
print(df['Age'].isnull().sum())

## Handle outliers

### Subtask:
Identify and handle outliers in the transaction amount. Techniques like Winsorizing or transformation could be considered, depending on the distribution of the transaction amounts.


**Reasoning**:
Calculate the IQR, define bounds, visualize the distribution, handle outliers using Winsorizing, and visualize again to verify.



In [None]:
# Calculate Q1, Q3, and IQR for 'Transaction_Amount'
Q1 = df['Transaction_Amount'].quantile(0.25)
Q3 = df['Transaction_Amount'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outlier detection (using 1.5 * IQR rule for reference)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound (1.5*IQR rule): {lower_bound}")
print(f"Upper bound (1.5*IQR rule): {upper_bound}")

# Visualize the distribution of 'Transaction_Amount' before handling outliers
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.boxplot(df['Transaction_Amount'])
plt.title('Box Plot of Transaction Amount Before Winsorizing')
plt.ylabel('Transaction Amount')

plt.subplot(1, 2, 2)
plt.hist(df['Transaction_Amount'], bins=50, edgecolor='black')
plt.title('Histogram of Transaction Amount Before Winsorizing')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Implement Winsorizing by capping at the 99th percentile
upper_cap = df['Transaction_Amount'].quantile(0.99)
df['Transaction_Amount_Winsorized'] = np.clip(df['Transaction_Amount'], a_min=None, a_max=upper_cap)

print(f"\nUpper cap for Winsorizing (99th percentile): {upper_cap}")

# Verify that the outliers have been handled by visualizing the distribution after Winsorizing
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.boxplot(df['Transaction_Amount_Winsorized'])
plt.title('Box Plot of Transaction Amount After Winsorizing')
plt.ylabel('Transaction Amount')

plt.subplot(1, 2, 2)
plt.hist(df['Transaction_Amount_Winsorized'], bins=50, edgecolor='black')
plt.title('Histogram of Transaction Amount After Winsorizing')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Encode categorical variables

### Subtask:
Convert categorical variables like payment method into a numerical format suitable for machine learning models, using techniques like one-hot encoding.


**Reasoning**:
Convert the 'Payment_Method' categorical column into numerical format using one-hot encoding and update the dataframe.



In [None]:
# Convert 'Payment_Method' into one-hot encoded columns
payment_method_encoded = pd.get_dummies(df['Payment_Method'], prefix='Payment_Method')

# Concatenate the new one-hot encoded columns with the original DataFrame and drop the original 'Payment_Method' column
df = pd.concat([df.drop('Payment_Method', axis=1), payment_method_encoded], axis=1)

# Display the first few rows of the updated DataFrame
print("First 5 rows of the DataFrame after one-hot encoding:")
display(df.head())

## Address data imbalance

### Subtask:
Handle the highly imbalanced target variable (fraud vs. non-fraud) using techniques like oversampling the minority class (e.g., SMOTE) or undersampling the majority class.


**Reasoning**:
Separate features and target, then apply SMOTE to handle the class imbalance and print the resampled class distribution.



In [None]:
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df.drop('Fraud', axis=1)
y = df['Fraud']

# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the dataset
X_resampled, y_resampled = smote.fit_resample(X, y)

# Print the class distribution of the resampled target variable
print("Class distribution after SMOTE:", Counter(y_resampled))

## Feature scaling

### Subtask:
Apply feature scaling (e.g., StandardScaler or MinMaxScaler) to numerical features, especially if the chosen machine learning model is sensitive to the scale of features.


**Reasoning**:
Identify numerical features, apply StandardScaler, and replace original numerical columns with scaled data.



In [None]:
from sklearn.preprocessing import StandardScaler

# Identify numerical columns (excluding one-hot encoded columns)
numerical_cols = ['Age', 'Transaction_Amount', 'Transaction_Amount_Winsorized']

# Instantiate StandardScaler
scaler = StandardScaler()

# Apply StandardScaler to the numerical columns of X_resampled
X_resampled[numerical_cols] = scaler.fit_transform(X_resampled[numerical_cols])

# Display the first few rows of the scaled X_resampled DataFrame
print("First 5 rows of the scaled X_resampled DataFrame:")
display(X_resampled.head())

## Split data

### Subtask:
Split the prepared data into training and testing sets to evaluate the model's performance on unseen data.


**Reasoning**:
Split the resampled data into training and testing sets and print their shapes.



In [None]:
from sklearn.model_selection import train_test_split

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Summary:

### Data Analysis Key Findings

*   The initial synthetic dataset successfully simulated the described issues, including approximately 5% missing values in 'Age', potential outliers in 'Transaction\_Amount' (maximum value significantly higher than the 75th percentile), and a highly imbalanced target variable where only 2% of samples were labeled as fraud.
*   Median imputation was successfully applied to address the missing values in the 'Age' column, resulting in no remaining missing values in this column.
*   Winsorizing at the 99th percentile was effective in handling outliers in the 'Transaction\_Amount' column, significantly reducing the range of values compared to the original distribution.
*   One-hot encoding was successfully applied to the 'Payment\_Method' categorical variable, creating new binary columns for each payment method and removing the original categorical column.
*   The SMOTE oversampling technique effectively balanced the target variable distribution, resulting in an equal number of samples for both fraud and non-fraud classes (9800 each).
*   Numerical features ('Age', 'Transaction\_Amount', and 'Transaction\_Amount\_Winsorized') were successfully scaled using `StandardScaler`, transforming their values to have a mean of 0 and a standard deviation of 1.
*   The prepared data was successfully split into training and testing sets, with 80% allocated for training and 20% for testing.

### Insights or Next Steps

*   The data is now prepared for training a machine learning model to detect fraudulent transactions. Given the balanced dataset and scaled numerical features, various classification algorithms can be applied.
*   Further model evaluation should focus on appropriate metrics for imbalanced datasets, such as precision, recall, F1-score, and the Area Under the Receiver Operating Characteristic curve (AUC-ROC), rather than just accuracy.
