# Linear Regression Guided Project

**`FCAI-CU-Community:`**
- [Discord](https://discord.gg/UGpXXsQ2qw)
- [GitHub](https://github.com/FCAI-CU-AI-Community)
- [YouTube](https://youtube.com/@fcai-cu-ai-community?si=qaeEzzDrOnrZpeph)


# Outline:

- [Introduction](#id1)
- [Problem Statement](#id2)
- [Setting up the Environment](#id3)
- [Data](#id4)
- [Exploratory Data Analysis (EDA)](#id5)
- [Data Preprocessing](#id6)
- [Modeling](#id7)
- [Testing](#id8)
- [Evaluation](#id9)
- [Revisiting](#id10)
- [Submission](#id11)

## <a id="id1">Introduction</a>

In this guided project, we will use the `Linear Regression` algorithm to predict the `Student's Performance` based on some features.

## <a id="id2">Problem Statement</a>

**Why do some students perform better than others?** This question is crucial for educators and policymakers seeking to improve educational outcomes. By analyzing student performance data, we can identify factors influencing academic success and develop targeted interventions to support students at risk of falling behind.

### Details:
- **Objective**: Develop a regression model to estimate final scores.
- **Use Case**: Assist in identifying factors influencing student outcomes, potentially aiding targeted educational interventions.
- **Target Variable**: Performance Index.

## <a id="id3">Setting up the Environment</a>

### 1. Import some required libraries (NumPy, Pandas, Matplotlib, Seaborn).

In [288]:
# Write your code here





## <a id="id4">Data</a>

### 1. Loading and Exploring the Data.

#### 1.1. Read the data card from here: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/data

#### 1.2. Load and view the data.

In [289]:
# Load the data


# View the first few rows of the data


#### 1.3. How many rows and columns are there in the data?

In [290]:
# print the shape of the data


#### 1.4. What are the columns in the data?

In [291]:
# print the columns of the data


#### 1.5. What are the data types of the columns, the memory usage of the data and the number of non-null values in the columns?

In [292]:
# print the data information


## <a id="id5">Exploratory Data Analysis (EDA)</a>

#### 2.1. Check for missing values.

In [293]:
# print the null count for each column


In [294]:
# print the null count for all the data


#### 2.2. print the statistical summary of the data.

In [295]:
# describe the data


Warm-up: Can you figure out if there is any outliers in the data?

*hint: Look at the mean and median values of the columns.*

In [296]:
# describe the categorical features


In [297]:
# Plot a histogram for the categorical features


#### 2.3. Plot the boxplot of each column.

In [298]:
# set the figure size to (10, 5)

# Set the figure title to 'Boxplot of Student Performance Data'

# create a boxplot of the data (use sns)

# show the plot


#### 2.4. Target Feature Analysis.

In [299]:
# Describe the target variable


Questions:
- What is the range of values for the target feature?
- What is the mean and median of the target feature?

In [300]:
# Plot the distribution of the target variable with kernel density estimation


Questions:
- Is the target feature normally distributed?

In [301]:
# Plot the boxplot of the target variable


Questions:
- Ar there any outliers in the target feature?

#### 2.5. Numerical Analysis. (Discrete Features)

In [302]:
# select the numerical data


# print the first few rows of the numerical data


In [303]:
# select the discrete numerical data (having less than 11 unique values)



# print the first few rows of the discrete data


In [304]:
# Print the unique values of the discrete data



In [305]:
# Visualize Discrete Features with their average Performance Index

# 1. Create a subplot with the appropriate rows and columns with a figure size of (15, 5)


# 2. Set the figure title to 'Average Performance Index by Discrete Features'
# 3. For each subplot, select the subplot, add an appropriate title (Avg Performance Index vs. ...) and create a barplot of the feature (use sns)

# 4. Show the plot


Questions:
- What is the relationship between the target and each of the discrete numerical features?
- Which feature do you think has the most impact on the target feature?
- Which feature do you think has the least impact on the target feature (Can be removed)?
- Does the students who sleep more hours perform better?
- Does the students who study more hours perform better?
- Does the students who practiced more questions perform better?

In [306]:
# Remove the features that having the least impact on the target variable



#### 2.6. Numerical Analysis. (Continuous Features)

In [307]:
# Select the continuous numerical data




# Print the first few rows of the continuous data


In [308]:
# Plot a scatter for each feature against the other features (pairplot)


# show the plot


In [309]:
# Plot the correlation matrix of the continuous data



# show the plot


Questions:
- What is the relationship between the target and each of the continues numerical features?
- Which feature do you think has the most impact on the target feature?
- Which feature do you think has the least impact on the target feature (Can be removed)?
- Does the previous score of the student affect the performance index? Is it increasing or decreasing?

#### 2.7. Categorical Analysis.

In [310]:
# select the categorical data


# Merge the categorical data with the target variable


# print the first few rows of the categorical data


In [311]:
# Plot box plots of the categorical data against the target variable


# show the plot


Questions:
- What is the relationship between the target and each of the categorical features?
- Does the students that do an extracurricular activity perform better than those who don't?

## <a id="id6">Data Preprocessing</a>

In [312]:
# Convert the categorical data to numerical data using one-hot encoding


# print the first few rows of the data


## <a id="id7">Modeling</a>

### 1. Splitting the data.

#### 1.1. Split the data into features and target.

In [313]:
# Select the features and put them in X

# Select the target variable and put it in y



#### 1.2. Split the data into training and testing sets.

In [314]:
# import the train_test_split function from sklearn


# Set the random state to 42

# Set the test size to 0.2


# Split the data into training and testing sets


# Print the shape of the training and testing sets



### 2. Building and Training the Model.

In [315]:
# import the linear regression model from sklearn


# Create an instance of the Linear Regression model


# Train the model on the training data


In [316]:
# Print the model parameters



In [317]:
# print the model score (R^2) on the training data


# <a id="id8">Testing</a>

In [318]:
# Predict the target variable on the testing data


# print the first few predictions


# print the first few actual values


## <a id="id9">Evaluation</a>

In [319]:
# import the mean squared error function from sklearn


# Calculate the mean squared error of the model


# print the mean squared error


In [320]:
# import the r2_score function from sklearn


# Calculate the R^2 score of the model


# print the R^2 score as a percentage


## <a id="id10">Revisiting</a>

Try training the model with all the features and see if the performance of the model changes *significantly*.

## <a id="id11">Submission</a>

1. Go to the code section of the data set and click on the `New Notebook` button.
2. Click on the `File` tab in the top left corner of the notebook.
3. Click import notebook.
4. Upload the notebook file.
5. Click on the `Save Version` button then click on the `Save` button.
6. Wait for the notebook to be saved and then click on the `Show Version` button (number beside the `Save Version` button).
7. Click on the `Go to Viewer` button.
8. Click on Share, Choose `public`.
9. Copy the link and paste it in `task-submission` channel.
10. You are done! 🎉
