# Machine Learning Intro Assignment

## Question 1
**Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS).**

Artificial Intelligence (AI) is a broad field of computer science focused on building systems capable of performing tasks that typically require human intelligence, such as reasoning, planning, perception, and problem-solving. AI encompasses a wide range of technologies, from rule-based systems to advanced learning models.

Machine Learning (ML) is a subfield of AI concerned with developing algorithms that enable computers to learn patterns and make decisions based on data, improving automatically with experience. ML removes the need for explicit programming by using data-driven approaches like regression, classification, and clustering.

Deep Learning (DL) is a subset of ML that uses multi-layered artificial neural networks to model complex relationships and extract features from large volumes of data. DL is especially powerful for tasks involving images, speech, and text, using hierarchical learning to capture intricate patterns (e.g., image recognition, language translation).

Data Science (DS) is an interdisciplinary field that combines statistics, programming, domain expertise, and data analysis to extract valuable insights and knowledge from structured and unstructured data. DS includes all aspects from data cleaning and preparation to statistical modeling and interpretation, leveraging AI, ML, and DL as tools within its methodology.

## Question 2
**What are the types of machine learning? Describe each with one real-world example.**

**Supervised Learning** involves training a model on labeled data, where each input has a known output. Example: Spam email detection, where emails are labeled as ‘spam’ or ‘not spam’, and the model learns to classify future emails based on these labels.

**Unsupervised Learning** works with unlabeled data, identifying patterns or groupings without predefined outputs. Example: Customer segmentation uses clustering algorithms to divide customers into groups based on behavior, despite not knowing group labels beforehand.

**Semi-supervised Learning** uses a combination of labeled and unlabeled data, useful when labeling data is expensive or time-consuming. Example: Photo categorization, where only some images are labeled but the system predicts labels for the unlabeled images after learning patterns from the labeled portion.

**Reinforcement Learning** trains agents to make sequences of decisions using feedback in the form of rewards or penalties. Example: Game-playing AI agents (like AlphaGo) that learn optimal moves over time by interacting with the game environment and receiving scores for actions.

## Question 3
**Define overfitting, underfitting, and the bias-variance tradeoff in machine learning.**

Overfitting occurs when a model learns the training data—including its noise—to such an extent that it performs excellently on the training set but poorly on new, unseen data. Typically, overfitted models are too complex relative to the available data and fail to generalize.

Underfitting happens when a model is too simplistic and fails to capture significant patterns in the training data, leading to poor performance on both the training and test sets. Underfitting usually arises from models with too few parameters or overly strong assumptions about data structure.

The bias-variance tradeoff is the balancing act between accuracy and complexity. High bias means the model makes strong assumptions—leading to underfitting—while high variance means the model is too flexible and sensitive to noise—resulting in overfitting. The best models find a “sweet spot” where both bias and variance are minimized, achieving good predictiveness and generalization.



## Question 4
**What are outliers in a dataset, and list three common techniques for handling them.**

Outliers are data points that significantly deviate from the majority of values in a dataset. They may arise due to errors in data collection, natural variability, or genuine rare events. Outliers can skew statistical measures and impact model accuracy.

Common techniques for handling outliers include:

- **Removal:** Identify and delete outliers using statistical thresholds, such as values lying beyond 3 standard deviations from the mean or outside the interquartile range (IQR).

- **Transformation:** Apply mathematical transformations (log, square root) to data to reduce the impact of outliers.

- **Imputation:** Replace outliers with statistical measures like mean, median, or with predicted values from models trained on non-outlier data.



## Question 5
**Explain the process of handling missing values and mention one imputation technique for numerical and one for categorical data.**

Handling missing values starts with identifying missing entries (nulls, NaN, or invalid data), followed by understanding their cause (random or systematic). Next, the data analyst chooses a treatment method, including removal (dropping rows/columns), imputation (filling missing values), or prediction (using models to estimate missing values).

For numerical data, one common imputation method is mean imputation: replacing missing values with the average of available values in the column. For categorical data, mode imputation is frequently used: filling missing entries with the most commonly occurring value in the category.



## Question 6
**Python: Create a synthetic imbalanced dataset and print class distribution.**

```python
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9,0.1], n_informative=2,
                           n_redundant=0, n_clusters_per_class=1, n_samples=1000, random_state=42)
unique, counts = np.unique(y, return_counts=True)
print('Class distribution:', dict(zip(unique, counts)))
```

Output:
-----------------------------------------------------------------------------------------------
```py
Class distribution: {np.int64(0): np.int64(898), np.int64(1): np.int64(102)}

## Question 7
**Python: One-hot encode a list of colors and print the DataFrame.**

```python
import pandas as pd

colors = ['Red', 'Green', 'Blue', 'Green', 'Red']
df = pd.DataFrame({'Color': colors})
encoded = pd.get_dummies(df, columns=['Color'])
print(encoded)
```
Output
---------------------------------------------------------------------------------------
```python 
Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False         True      False
4       False        False       True

## Question 8
**Python: Simulate normal distribution, introduce missing values, mean-fill, plot histograms.**

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

samples = np.random.normal(loc=0, scale=1, size=1000)
df = pd.DataFrame(samples, columns=['Sample'])
missing_idx = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_idx, 'Sample'] = np.nan

plt.hist(df['Sample'].dropna(), bins=30, alpha=0.5, label='Before Imputation')
df['Sample'].fillna(df['Sample'].mean(), inplace=True)
plt.hist(df['Sample'], bins=30, alpha=0.5, label='After Imputation')
plt.legend()
plt.show()
```

Output:
---------------------------------------------------------------------------------------
```python
C:\Users\om\AppData\Local\Temp\ipykernel_7216\2870388570.py:11: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Sample'].fillna(df['Sample'].mean(), inplace=True)
```

<Figure size 640x480 with 1 Axes>

## Question 9
**Python: Min-Max scaling for given numbers.**

```py
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[2], [5], [10], [15], [20]])
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print('Scaled data:', scaled.flatten())
```
Output:
-------------------------------------------------------------
```py 
Scaled data: [0.         0.16666667 0.44444444 0.72222222 1.        ]

## Question 10
**Data Preparation Plan for Retail Fraud Detection**

As a data scientist, preparing this retail customer transaction data involves:

- **Step 1: Data Cleaning**: Remove duplicates/irrelevant entries.
- **Step 2: Handle Missing Data**: For ages, use mean imputation (numerical) or `unknown` for extremes.
- **Step 3: Outlier Treatment**: Detect via Z-score/IQR, remove or use Winsorization.
- **Step 4: Imbalance Handling**: Use resampling (SMOTE), downsampling, or cost-sensitive learning.
- **Step 5: Encode Categoricals**: One-hot encoding for payment method.
- **Step 6: Feature Engineering**: New features, Min-Max scale numbers.
- **Step 7: Model Step**: Train/test split.

Example Python template:

```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from imblearn.over_sampling import SMOTE
# df = pd.read_csv('transactions.csv')
# df['Age'].fillna(df['Age'].mean(), inplace=True)
# df = df[df['TransactionAmount'].between(df['TransactionAmount'].quantile(0.05), df['TransactionAmount'].quantile(0.95))]
# ohe = OneHotEncoder(sparse=False)
# encoded = ohe.fit_transform(df[['PaymentMethod']])
# smote = SMOTE()
# X_res, y_res = smote.fit_resample(X, y)
```
