# Final Project: Regression Analysis - TITLE TBD

Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/applied-ml-data-git-hub <br>
19 March 2025 <br>

### Introduction

## Section 1. Import and Inspect the Data

Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like pandas, numpy, matplotlib, seaborn, and scikit-learn, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

Pandas is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

NumPy (Numerical Python) is a foundational library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a comprehensive collection of mathematical functions to operate on these arrays efficiently. NumPy is a key component in scientific computing and machine learning. <br>
https://numpy.org/doc/stable/ <br>

Matplotlib is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

Seaborn is a statistical data visualization library built on top of Matplotlib, designed for creating visually appealing and informative plots. It simplifies complex visualizations, such as heatmaps, violin plots, and pair plots, making it easier to identify patterns and relationships in datasets. <br>
https://seaborn.pydata.org/ <br>

Scikit-learn provides a variety of tools for machine learning, including data preprocessing, model selection, and evaluation. It contains essential functions for building predictive models and analyzing datasets. <br>
sklearn.metrics: This module provides various performance metrics for evaluating machine learning models. <br>
https://scikit-learn.org/stable/modules/model_evaluation.html<br>

IPython.core.display is a module from the IPython library that provides tools for displaying rich output in Jupyter Notebooks, including formatted text, images, HTML, and interactive widgets. It enhances visualization and interaction within Jupyter environments.
https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.display.html <br>

In [None]:
# Data handling
import pandas as pd
import numpy as np

# Machine learning imports
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error, mean_squared_error, r2_score, precision_score, recall_score, f1_score, classification_report, silhouette_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV, RFE, SelectKBest, f_classif, mutual_info_classif

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Fully disable output truncation in Jupyter (for VS Code)
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

### 1.1. Load the Dataset and Display the First 10 Rows.

### 1.2. Check for Missing Values and Display Summary Statistics.

### Reflection 1: What do you notice about the dataset? Are there any data issues?

---

## Section 2. Data Exploration and Preparation

### 2.1. Explore Data Patterns and Distribution

#### 2.1.1 Create Histograms, Box Plots, and Count Plots for Categorical Variables (as Applicable)

#### 2.1.2. Identify Patterns, Outliers, and Anomalies in Feature Distributions

#### 2.1.3. Check for Class Imbalance in the Target Variable (as Applicable)

### 2.2. Handle Missing Values and Clean Data

#### 2.2.1. Impute or Drop Missing Values (as Applicable)

#### 2.2.2. Remove or Transform Outliers (as Applicable). 

#### 2.2.3. Convert Categorical Data to Numerical Format Using Encoding (as Applicable)

### 2.3. Feature Selection and Engineering

#### 2.3.1. Create New Features (as Applicable)

#### 2.3.2. Transform or Combine Existing Features to Improve Model Performance (as Applicable)

#### 2.3.3. Scale or Normalize Data (as Applicable)

### Reflection 2: What patterns or anomalies do you see? Do any features stand out? What preprocessing steps were necessary to clean and improve the data? Did you create or modify any features to improve performance?

---

## Section 3. Feature Selection and Justification

### 3.1 Choose Features and Target

### 3.2 3.2 Define X and y

### Reflection 3: Why did you choose these features? How might they impact predictions or accuracy?

---

## Section 4. Train a Model (Linear Regression)

### 4.1. Split the Data into Training and Test Sets Using `train_test_split` (or `StratifiedShuffleSplit` if Class Imbalance is an Issue)

### 4.2. Train Model Using Scikit-Learn `model.fit()` Method

### 4.3. Evalulate Performance, for Example: Regression: R^2, MAE, RMSE; Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix; Clustering: Inertia, Silhouette Score

### Reflection 4: How well did the model perform? Any surprises in the results?

---

## Section 5. Improve the Model or Try Alternates (Implement Pipelines)

### 5.1. Implement Pipeline 1: Imputer → StandardScaler → Linear Regression.

### 5.2. Implement Pipeline 2: Imputer → Polynomial Features (degree=3) → StandardScaler → Linear Regression.

### 5.3 Compare Performance of All Models Across the Same Performance Metrics

### Reflection 5: Which models performed better? How does scaling impact results?

---

## Section 6. Final Thoughts & Insights

### 6.1. Summarize Findings

### 6.2. Discuss Challenges Faced

### 6.3. If You Had More Time, What Would You Try Next?

### Reflection 6: What did you learn from this project?

---

### References:

