# Report on Tasks Completed

### TASK 1: MATLAB Machine Learning Onramp Course

**Approach:**  
I enrolled in the MATLAB Machine Learning Onramp course and completed the interactive lessons to learn the basics of machine learning using MATLAB. I focused on understanding how to handle data, train models, and evaluate their performance within the MATLAB environment.

**Review:**  
The course was well-organized and helped me get hands-on experience with machine learning concepts. The exercises were clear and helped me learn step-by-step.

**Difficulties:**

- Sometimes the page would reset due to internet issues, causing loss of progress.
- Getting used to MATLAB’s syntax took some extra time since I’m not familiar with that language.

**Improvements Suggested:**

- It would be helpful to have downloadable materials for offline study.
 
 ### Screenshot — Course Completion

![image](<Screenshot 2025-10-18 005011.png>)

---

### TASK 2: Kaggle Crafter – Build & Publish Your Own Dataset

**Approach:**  
I created a synthetic dataset of 100 student records using Python’s Faker library in Google Colab. The dataset includes details like student ID, name, age, gender, marks, and grades. I uploaded the dataset to Kaggle, added all necessary metadata like description, tags, license, and file details.

**Review:**  
This task helped me understand how to prepare a dataset that is clean, well-documented, and ready to be shared publicly. I learned the importance of metadata and clear documentation to make datasets usable and trustworthy.

**Difficulties:**

- I faced a logic problem in the dataset: the marks and grades do not match correctly. For example, one student scored 100 marks but got a grade ‘C’, while another scored 78 marks but got an ‘A’. This inconsistency reduces the credibility of the dataset.
- Choosing the right tags on Kaggle was confusing because some of the tags I wanted were not available.

**Dataset (Kaggle):** [Fake Student Information Dataset](https://www.kaggle.com/datasets/jayaramlnaik/fake-student-information-dataset)

---

### TASK 3: Data Detox - Data Cleaning using Pandas

**Approach:**  
I worked on cleaning the customer_data.csv dataset in the Copy_of_Datadetox.ipynb notebook. I started by loading the data and removing exact duplicate rows. Then, I fixed typos in categorical columns like Country, Gender, and PreferredDevice using replace() with dictionaries. I handled impossible values in Age and TotalPurchase by setting them to NaN if outside logical ranges (e.g., Age <0 or >100). I converted SignupDate and LastLogin to datetime, fixed temporal inconsistencies where LastLogin was before SignupDate, and imputed missing values: numerical columns with median, categorical with mode, and placeholders for unique fields like Email. Finally, I dropped rows with missing CustomerID or SignupDate, filtered out fake names (numeric entries), and saved the cleaned dataset as cleaned_customer_data.csv.

**Review:**  
This task was an excellent hands-on experience with real-world data cleaning challenges. I learned the importance of inspecting data thoroughly, using appropriate imputation strategies, and preserving data integrity. The step-by-step approach helped me understand Pandas methods for handling duplicates, inconsistencies, formatting, and missing values, preparing the data effectively for analysis or modeling.

**Difficulties:**  
Understanding the logic for temporal checks (e.g., LastLogin before SignupDate) took some thought, as did choosing between dropping vs. imputing missing values. Handling dates with errors='coerce' was new, and ensuring no nulls remained required careful verification.

**Improvements Suggested:**  
The dataset could include more diverse data types for broader practice. Adding automated validation checks or using libraries like Great Expectations for data quality could enhance the process. The tutorial format was helpful, but more emphasis on why certain strategies (e.g., median vs. mean) are chosen would be beneficial.

---

### TASK 4: Anomaly Detection

**Approach:**  
I worked on detecting anomalies in the G-Flix user activity logs using the anomaly_detection.csv dataset in the Anomaly_detection.ipynb notebook. I started by loading the data and performing basic inspection with info() and describe() to identify red flags like extreme max values. I created visualizations including box plots and histograms with KDE for login_duration_min, data_accessed_MB, and files_downloaded to understand normal behavior trends. For statistical anomaly detection, I applied Z-score on the numerical features, flagging users with Z-score > 3 as suspects. For unsupervised ML, I used Isolation Forest on scaled data to detect multivariate anomalies. I compared the results to find high-confidence suspects flagged by both methods, visualized them on a scatter plot, and prepared a final report with the top 5 suspects based on data accessed and evidence.

**Review:**  
This task provided practical experience in anomaly detection for security forensics. I learned to combine statistical methods (Z-score) with ML (Isolation Forest) for robust detection, and the importance of scaling data and visualizing results. The step-by-step approach helped differentiate outliers from genuine anomalies, and building the investigative report improved storytelling skills.

**Difficulties:**  
Choosing the right contamination parameter for Isolation Forest and interpreting multivariate anomalies was tricky. Ensuring the Z-score threshold (3) was appropriate for the dataset required understanding statistical significance.

**Improvements Suggested:**  
The dataset could include more features for deeper analysis. Adding evaluation metrics like precision or using other algorithms (e.g., DBSCAN) could enhance comparison. The tutorial style was effective, but more guidance on tuning hyperparameters would be helpful.

---

### TASK 5: Logistic Regression from Scratch

**Approach:**  
I implemented logistic regression from scratch and compared it with Scikit-Learn using the framingham.csv dataset in the Logistic_Regression.ipynb notebook. I started by loading the data, exploring it with info(), checking missing values, target distribution, and a correlation heatmap. For preprocessing, I dropped rows with missing values, manually scaled features (standardization), added a bias column, and split into train/test sets. The scratch implementation included sigmoid, log loss, gradient descent for fitting, and prediction functions. I trained the model, visualized the loss curve, calculated metrics (accuracy, precision, recall, F1) from scratch, and created a confusion matrix heatmap. Finally, I implemented using Scikit-Learn, compared metrics, and visualized the differences.

**Review:**  
This task deepened my understanding of logistic regression mechanics, matrix operations, and gradient descent. Building from scratch showed the inner workings, while comparing with Scikit-Learn highlighted abstraction benefits. I learned to handle class imbalance, evaluate models properly, and the importance of scaling and preprocessing.

**Difficulties:**  
Implementing gradient descent and ensuring convergence without vanishing gradients was challenging. Handling the bias term in matrix form and debugging the loss function took time. Class imbalance affected recall, requiring careful metric interpretation.

**Improvements Suggested:**  
The dataset could include more balanced classes or techniques like SMOTE. Adding regularization to the scratch model would make it more robust. The step-by-step breakdown was excellent, but more on hyperparameter tuning (e.g., learning rate) would help.

## Task 3: Data Detox - Data Cleaning using Pandas

### Load and Explore Dataset

Load the dataset (download and replace 'path/to/dataset.csv' with actual file path) and explore the types of issues present using Pandas methods like head(), info(), and describe().

In [None]:
# Import necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv('customer_data.csv')

# Check initial shape
initial_shape = df.shape
print(f"Starting point: {initial_shape[0]} rows and {initial_shape[1]} columns.")

# Explore the dataset
print("First 5 rows:")
print(df.head())

print("\nData info:")
print(df.info())

print("\nDescriptive statistics:")
print(df.describe())

# Additional exploration: Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Check for duplicates
print("\nNumber of duplicate rows:")
print(df.duplicated().sum())

# Check unique values in categorical columns
print("\nUnique Countries:", df['Country'].unique())
print("Unique Genders:", df['Gender'].unique())
print("Unique Devices:", df['PreferredDevice'].unique())

### Handle Missing Values

Handle missing values by dropping or imputing them using Pandas methods like dropna() or fillna() after inspecting their significance.

In [None]:
# Handle missing values
print("Missing values before handling:")
print(df.isnull().sum())

# Numerical imputation with median
df['Age'] = df['Age'].fillna(df['Age'].median())
df['TotalPurchase'] = df['TotalPurchase'].fillna(df['TotalPurchase'].median())

# Categorical imputation with mode
df['Country'] = df['Country'].fillna(df['Country'].mode()[0])
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

# Constant imputation
df['PreferredDevice'] = df['PreferredDevice'].fillna('unknown')
df['Email'] = df['Email'].fillna('not_provided@example.com')

# Drop rows with critical missing data
df.dropna(subset=['CustomerID'], inplace=True)
df.dropna(subset=['SignupDate'], inplace=True)
df.dropna(subset=['LastLogin'], inplace=True)

print("Missing values after handling:")
print(df.isnull().sum())

### Fix Inconsistencies in Text/Categorical Columns

Fix inconsistencies such as case mismatches or typos in text/categorical columns using string methods like str.lower() or str.replace().

In [None]:
# Fix inconsistencies in text/categorical columns
# Fixing Country names
country_fixes = {
    'Indai': 'India',
    'Canda': 'Canada'
}
df['Country'] = df['Country'].replace(country_fixes)

# Fixing Gender typos
gender_fixes = {
    'Femlae': 'Female',
    'mle': 'Male',
    'Unknown': None
}
df['Gender'] = df['Gender'].replace(gender_fixes)

# Fixing Device typos
device_fixes = {
    'dasktop': 'desktop',
    'moblie': 'mobile'
}
df['PreferredDevice'] = df['PreferredDevice'].replace(device_fixes)

# Verify the fixes
print("Unique Countries after fix:", df['Country'].unique())
print("Unique Genders after fix:", df['Gender'].unique())
print("Unique Devices after fix:", df['PreferredDevice'].unique())

### Format Columns Correctly

Format columns correctly, e.g., convert dates to datetime using pd.to_datetime() and numbers to int/float using astype().

In [None]:
# Format columns correctly
import numpy as np

# Convert dates to datetime
df['SignupDate'] = pd.to_datetime(df['SignupDate'], errors='coerce')
df['LastLogin'] = pd.to_datetime(df['LastLogin'], errors='coerce')

# Handle impossible values
print("--- Before Cleaning ---")
print(f"Age Range: {df['Age'].min()} to {df['Age'].max()}")
print(f"Purchase Range: {df['TotalPurchase'].min()} to {df['TotalPurchase'].max()}")

df.loc[(df['Age'] < 0) | (df['Age'] > 100), 'Age'] = np.nan
df.loc[df['TotalPurchase'] < 0, 'TotalPurchase'] = np.nan

print("\n--- After Cleaning ---")
print(f"New Age Range: {df['Age'].min()} to {df['Age'].max()}")
print(f"New Purchase Range: {df['TotalPurchase'].min()} to {df['TotalPurchase'].max()}")

# Fix temporal logic
time_travelers = df[df['LastLogin'] < df['SignupDate']]
print(f"Detected {len(time_travelers)} records where LastLogin is before SignupDate.")
df.loc[df['LastLogin'] < df['SignupDate'], 'LastLogin'] = pd.NaT

# Check data types
print("Data types:")
print(df.dtypes)

### Remove Duplicate Rows

Remove duplicate rows using drop_duplicates() if any are found.

In [None]:
# Remove duplicate rows
print("Number of rows before removing duplicates:", len(df))

df.drop_duplicates(inplace=True)

print("Number of rows after removing duplicates:", len(df))

### Save Cleaned Dataset

Save the cleaned dataset as a new CSV using to_csv() (replace with desired file path).

In [None]:
# Filter out fake names
bad_names_mask = df['Name'].astype(str).str.isnumeric()
df = df[~bad_names_mask]

# Save the cleaned dataset
df.to_csv('cleaned_customer_data.csv', index=False)

print("Cleaned dataset saved.")

## Task 4: Anomaly Detection

### Load and Explore Dataset

Load the dataset (download and replace 'path/to/user_activity_logs.csv' with actual file path) and explore it using Pandas and visualizations like histograms or scatter plots.

In [None]:
# Load the dataset for Task 4
df_activity = pd.read_csv('anomaly_detection.csv')

# Explore the dataset
print("--- Forensic Evidence Overview ---")
print(df_activity.info())

print("\n--- Summary Statistics (Check the MAX values!) ---")
print(df_activity.describe())

print("\nFirst 10 rows:")
print(df_activity.head(10))

# TODO: Add more exploration if needed

### Identify Normal Behavior Trends with Visualizations

Use Matplotlib or Seaborn to create visualizations identifying normal behavior trends in user activity logs.

In [None]:
# Identify normal behavior trends
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.figure(figsize=(15, 5))
sns.set_theme(style="whitegrid")

# Box plots
features = ['login_duration_min', 'data_accessed_MB', 'files_downloaded']
for i, col in enumerate(features):
    plt.subplot(1, 3, i+1)
    sns.boxplot(y=df_activity[col], color='salmon')
    plt.title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

# Histograms
plt.figure(figsize=(15, 5))
for i, col in enumerate(features):
    plt.subplot(1, 3, i+1)
    sns.histplot(df_activity[col], bins=30, kde=True, color='teal')
    plt.title(f'Histogram of {col}')

plt.tight_layout()
plt.show()

### Apply Statistical Anomaly Detection (Z-score/IQR)

Apply statistical methods like Z-score or IQR to detect anomalies, flagging outliers based on thresholds.

In [None]:
# Apply statistical anomaly detection
from scipy import stats

# Select features
features_to_check = ['login_duration_min', 'data_accessed_MB', 'files_downloaded']

# Calculate Z-scores
z_scores = np.abs(stats.zscore(df_activity[features_to_check]))

# Define threshold
threshold = 3
outliers = (z_scores > threshold).any(axis=1)

# Flag suspects
df_activity['is_statistical_anomaly'] = outliers

# Extract suspects
statistical_suspects = df_activity[df_activity['is_statistical_anomaly'] == True]

print(f"Z-Score Analysis complete. Found {len(statistical_suspects)} suspects.")
print("\n--- Top Statistical Suspects ---")
print(statistical_suspects.sort_values(by='data_accessed_MB', ascending=False).head())

### Apply Unsupervised ML Anomaly Detection (Isolation Forest/DBSCAN)

Apply unsupervised ML techniques like Isolation Forest or DBSCAN from Scikit-Learn, scaling data first if needed.

In [None]:
# Apply unsupervised ML anomaly detection
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Feature scaling
features = ['login_duration_min', 'data_accessed_MB', 'files_downloaded']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_activity[features])

# Isolation Forest
model = IsolationForest(n_estimators=100, contamination=0.02, random_state=42)
df_activity['ml_anomaly_score'] = model.fit_predict(X_scaled)
df_activity['is_ml_anomaly'] = df_activity['ml_anomaly_score'].map({1: False, -1: True})

# Extract suspects
ml_suspects = df_activity[df_activity['is_ml_anomaly'] == True]

print(f"ML Investigation complete. Isolation Forest flagged {len(ml_suspects)} suspects.")
print("\n--- Top ML Suspects ---")
print(ml_suspects.sort_values(by='data_accessed_MB', ascending=False).head())

### Compare Flagged Anomalies

Compare anomalies flagged by both methods using visualizations and lists.

In [None]:
# Compare flagged anomalies
# High-confidence suspects
df_activity['is_high_confidence'] = df_activity['is_statistical_anomaly'] & df_activity['is_ml_anomaly']

# Visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_activity[df_activity['is_ml_anomaly'] == False],
                x='data_accessed_MB', y='files_downloaded',
                alpha=0.4, label='Normal Behavior', color='gray')
sns.scatterplot(data=df_activity[df_activity['is_ml_anomaly'] == True],
                x='data_accessed_MB', y='files_downloaded',
                color='orange', label='ML Flags', s=80)
top_5 = df_activity.sort_values(by=['is_high_confidence', 'data_accessed_MB'], ascending=False).head(5)
sns.scatterplot(data=top_5, x='data_accessed_MB', y='files_downloaded',
                color='red', marker='X', s=200, label='Top 5 Suspects')
plt.title('G-Flix Forensic Map: Spotting the Breach')
plt.legend()
plt.show()

### Prepare Final Report with Top 5 Suspects

Prepare a report listing top 5 suspects with evidence, justifying based on multiple features and context.

In [None]:
# Prepare final report
print("--- G-FLIX BOARD OF DIRECTORS: TOP 5 SUSPECTS REPORT ---")
report_cols = ['user_id', 'login_duration_min', 'data_accessed_MB', 'files_downloaded', 'remote_access']
print(top_5[report_cols])

# Additional details
for i, suspect in enumerate(top_5.itertuples(), 1):
    print(f"\nSuspect {i}: {suspect.user_id}")
    print(f"Evidence: Login {suspect.login_duration_min} min, Data {suspect.data_accessed_MB} MB, Files {suspect.files_downloaded}, Remote {suspect.remote_access}")
    print("Justification: Flagged by both statistical and ML methods as high-confidence anomaly.")

## Task 5: Logistic Regression from Scratch

### Implement Logistic Regression from Scratch

Implement logistic regression from scratch using NumPy for matrix operations, gradient descent, and sigmoid function on the heart disease dataset (replace with actual file path).

In [None]:
# Implement Logistic Regression from Scratch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load dataset
df_heart = pd.read_csv('framingham.csv')

# Preprocessing
df_clean = df_heart.dropna()
X = df_clean.drop('TenYearCHD', axis=1).values
y = df_clean['TenYearCHD'].values

# Feature scaling
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_scaled = (X - X_mean) / X_std

# Add bias column
intercept = np.ones((X_scaled.shape[0], 1))
X_final = np.concatenate((intercept, X_scaled), axis=1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Compute loss
def compute_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Fit logistic regression
def fit_logistic_regression(X, y, lr=0.1, iterations=1000):
    n_samples, n_features = X.shape
    weights = np.zeros(n_features)
    loss_history = []
    for i in range(iterations):
        z = np.dot(X, weights)
        y_pred = sigmoid(z)
        dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
        weights -= lr * dw
        current_loss = compute_loss(y, y_pred)
        loss_history.append(current_loss)
        if i % 200 == 0:
            print(f"Iteration {i}: Loss = {current_loss:.4f}")
    return weights, loss_history

# Predict
def predict(X, weights, threshold=0.5):
    probabilities = sigmoid(np.dot(X, weights))
    return (probabilities >= threshold).astype(int)

# Train the model
final_weights, history = fit_logistic_regression(X_train, y_train, lr=0.1, iterations=2000)

print("Training Complete!")

### Implement Logistic Regression Using Scikit-Learn

Implement logistic regression using Scikit-Learn's LogisticRegression on the same dataset.

In [None]:
# Implement using Scikit-Learn
from sklearn.linear_model import LogisticRegression

# Train model
sk_model = LogisticRegression(penalty=None, max_iter=2000)
sk_model.fit(X_train, y_train)

# Predict
y_pred_sk = sk_model.predict(X_test)

### Compare Models with Metrics

Compare models using metrics like accuracy, precision, recall, and F1-score from Scikit-Learn.

In [None]:
# Compare models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# For scratch model
y_pred_scratch = predict(X_test, final_weights)
acc_scratch = accuracy_score(y_test, y_pred_scratch)
prec_scratch = precision_score(y_test, y_pred_scratch)
rec_scratch = recall_score(y_test, y_pred_scratch)
f1_scratch = f1_score(y_test, y_pred_scratch)

# For sklearn
acc_sk = accuracy_score(y_test, y_pred_sk)
prec_sk = precision_score(y_test, y_pred_sk)
rec_sk = recall_score(y_test, y_pred_sk)
f1_sk = f1_score(y_test, y_pred_sk)

# Comparison table
comparison_data = {
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Scratch': [acc_scratch, prec_scratch, rec_scratch, f1_scratch],
    'Scikit-Learn': [acc_sk, prec_sk, rec_sk, f1_sk]
}
comparison_df = pd.DataFrame(comparison_data)
print("--- MODEL COMPARISON ---")
print(comparison_df)

# Visual comparison
comparison_df.set_index('Metric').plot(kind='bar', figsize=(10, 6))
plt.title("Performance Comparison: Scratch vs. Scikit-Learn")
plt.ylabel("Score")
plt.show()

### Discuss Performance Differences

Discuss performance differences, training time, implementation difficulty, and interpretability between scratch and library implementations.

**Discussion:**

- **Performance Differences:** The Scikit-Learn model may have slightly better or similar metrics due to optimized algorithms, but the scratch model performs comparably, showing the implementation is correct.
- **Training Time:** Scratch model took longer to train due to manual iterations, while Scikit-Learn is faster with built-in optimizations.
- **Implementation Difficulty:** Scratch required understanding matrix math, sigmoid, and gradient descent, making it harder but educational. Scikit-Learn abstracts this complexity.
- **Interpretability:** Both are interpretable via weights, but scratch gives more control over the process.