# Analyzing Startup Success: A Multiple Linear Regression Approach Using the 50_Startups Dataset

![50 startups](https://raw.githubusercontent.com/Ebimsv/Machine_Learning_Course/refs/heads/main/pics/50_startups.png)

## Imports

In [2]:
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
import missingno as msno  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score  

## Load the dataset 

In [3]:
df = pd.read_csv('../../Data/50_Startups.csv')  
df.head()  

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,166597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Preprocessing

### 1. Check and Handle Missing Values
Before proceeding with modeling, it's important to identify and handle any missing values in the dataset.   
This step includes printing the count of missing values per column and visualizing them using a matrix plot to understand their distribution.

### 2. Imputation of Missing Values
To ensure the dataset is complete, missing values in the 'Administration' and 'Marketing Spend' columns are filled with their respective medians.   
This approach preserves the data distribution while addressing gaps in the dataset.

### 3. Convert Categorical Variables
The 'State' categorical variable is converted into a category data type for better handling during analysis. This step is crucial for preparing the data for encoding.

### 4. Encoding Categorical Variables
To enable effective modeling, categorical variables are transformed into numerical format using one-hot encoding.   
This process creates binary columns for each category, allowing the regression model to utilize these features.

### 5. Change Order of Columns
Rearranging the columns into a logical order improves the readability of the dataset.   
This step ensures that similar attributes are grouped together, making it easier to navigate the data.

### 6. Rename Columns for Better Readability
To enhance the clarity of the dataset, specific columns are renamed. This makes it easier to interpret the data without confusion arising from spaces or lengthy names.

### 7. Outlier Detection and Removal
Outliers can significantly skew the results of data analysis and modeling. This step involves detecting and removing outliers for numerical columns in the dataset using the Interquartile Range (IQR) method. A boxplot is also generated for visualizing the distribution and identifying potential outliers within each relevant column.

In [4]:
import pandas as pd  
import matplotlib.pyplot as plt  
import numpy as np  

# Function to detect and remove outliers for a single column  
def detect_and_remove_outliers(df, column_name, multiplier=1.5):  
    """  
    Detect and remove outliers from a specified column in the DataFrame.  
    
    Parameters:  
    df (pd.DataFrame): The DataFrame from which to remove outliers.  
    column_name (str): The column in which to detect outliers.  
    multiplier (float): The multiplier for the IQR method to define outliers.  

    Returns:  
    pd.DataFrame: DataFrame without outliers.  
    pd.DataFrame: Outliers detected in the specified column.  
    """  
    # Calculate quantiles for outlier detection  
    pass

    return df_no_outliers, outliers

## Feature Analysis and Selection Process for Predicting Profit

### 1. Correlation Analysis on Non-Binary Columns
This section identifies non-binary columns and computes the correlation matrix.

### 2. Visualize the Correlation Heatmap
This section visualizes the correlation matrix using a heatmap for better interpretation.

### 3. Interpret Correlations

**High Correlation**:
- `R&D_Spend` has a very high correlation with Profit (0.979), suggesting it is a strong predictor.
- `Marketing_Spend` also shows a notable positive correlation with Profit (0.718).

**Low Correlation**:  
- `Administration` has a low correlation with Profit (0.121), indicating it may not be a significant predictor of profit.

### 4. Feature Selection
This section selects features for modeling based on the correlation analysis, typically by choosing those with significant correlation with the target variable.

## Model Training 

## 2. Model Parameters  
Display model intercept and coefficients  

##  Predictions and Model Evaluation  

## Visualization (Optional)  

()