# Lesson: Building a Simple Regression Model with train.csv

 **Goal**: Predict the current price of a vehicle using one feature (e.g., horsepower (hp)) through a simple linear regression model.

 **Dataset** : The dataset has 12 variables. I use hp (horsepower) as the predictor and current price as the target. 
 
 Meanining hp is the independent variable and current price is the dependent variable. 

 I will check to see which one has the best correlation with current price


### 1. Import Libraries 

In [60]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.metrics import accuracy_score

### 2. Load the Dataset 

In [61]:
#load the dataset 
originalDataSet = pd.read_csv('train.csv')

#View the first 20 rows
print("First 5 rows here")
print(originalDataSet.head())

print("Last 5 rows here")
print(originalDataSet.tail())

First 5 rows here
   v.id  on road old  on road now  years      km  rating  condition  economy  \
0     1       535651       798186      3   78945       1          2       14   
1     2       591911       861056      6  117220       5          9        9   
2     3       686990       770762      2  132538       2          8       15   
3     4       573999       722381      4  101065       4          3       11   
4     5       691388       811335      6   61559       3          9       12   

   top speed  hp  torque  current price  
0        177  73     123       351318.0  
1        148  74      95       285001.5  
2        181  53      97       215386.0  
3        197  54     116       244295.5  
4        160  53     105       531114.5  
Last 5 rows here
     v.id  on road old  on road now  years      km  rating  condition  \
995   996       633238       743850      5  125092       1          6   
996   997       599626       848195      4   83370       2          9   
997   998    

### 3. Explore and Clean the Data

#### Purpose: Ensure there are no missing values that could break the model.


In [86]:
originalDataSet.info()
#check for missing values
print(originalDataSet.isnull().sum())

#Drop rows with missing values 
originalDataSet = originalDataSet.dropna()

#drop vid column
data = originalDataSet.drop("v.id", axis="columns")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   v.id           1000 non-null   int64  
 1   on road old    1000 non-null   int64  
 2   on road now    1000 non-null   int64  
 3   years          1000 non-null   int64  
 4   km             1000 non-null   int64  
 5   rating         1000 non-null   int64  
 6   condition      1000 non-null   int64  
 7   economy        1000 non-null   int64  
 8   top speed      1000 non-null   int64  
 9   hp             1000 non-null   int64  
 10  torque         1000 non-null   int64  
 11  current price  1000 non-null   float64
dtypes: float64(1), int64(11)
memory usage: 93.9 KB
v.id             0
on road old      0
on road now      0
years            0
km               0
rating           0
condition        0
economy          0
top speed        0
hp               0
torque           0
current price    

### 4. Select (hp) as Feature and (current price) as Target 
(hp )is the independent variable and (current price) is the dependent variable

In [87]:
# Use hp (horse power) to predict 'current price'
X = data[['hp', 'on road old', 'on road now', 'years', 'km', 'rating', 'condition', 'economy', 'top speed', 'torque']] #Feature independent variable
y = data[['current price']]  #Target dependent variable

# The Correllation Matrix

In [89]:
data.corr()

Unnamed: 0,on road old,on road now,years,km,rating,condition,economy,top speed,hp,torque,current price
on road old,1.0,0.034113,0.007207,0.007488,-0.050717,-0.015682,-0.030097,-0.023816,-0.049266,0.00895,0.233035
on road now,0.034113,1.0,0.004609,-0.053202,0.02828,-0.005043,-0.01588,0.012699,-0.012719,0.017955,0.282793
years,0.007207,0.004609,1.0,-0.002089,0.027285,0.053579,0.05022,0.025148,-0.003272,0.028859,-0.011854
km,0.007488,-0.053202,-0.002089,1.0,-0.03993,-0.01364,0.03268,0.02645,-0.052918,0.013566,-0.935924
rating,-0.050717,0.02828,0.027285,-0.03993,1.0,0.015943,-0.009757,-0.042222,-0.022623,0.004408,0.035038
condition,-0.015682,-0.005043,0.053579,-0.01364,0.015943,1.0,0.058788,0.018472,-0.071552,0.047805,0.110108
economy,-0.030097,-0.01588,0.05022,0.03268,-0.009757,0.058788,1.0,-0.059402,-0.016782,0.041632,-0.034711
top speed,-0.023816,0.012699,0.025148,0.02645,-0.042222,0.018472,-0.059402,1.0,0.057827,-0.019697,-0.027993
hp,-0.049266,-0.012719,-0.003272,-0.052918,-0.022623,-0.071552,-0.016782,0.057827,1.0,-0.013817,0.030238
torque,0.00895,0.017955,0.028859,0.013566,0.004408,0.047805,0.041632,-0.019697,-0.013817,1.0,-0.00229


Note X uses double brackets [[]] to keep it as a DataFrame (required by sklearn)

### 5. Split Data Into Trainning and Testing Sets

Purpose: Train the model on 80% of the data and test it on the remaining 20%.

In [None]:
# Split data: 80% for trainning, 20% for testing

X_train , X_test , y_train , y_test = train_test_split(X,y, test_size=0.2, random_state=42)

### 6. Train the Regression Model
What’s happening: The model learns the relationship between hp and current price.


In [91]:
#Create and Train the model

model = LinearRegression()
model.fit(X_train, y_train)

### 7. Make Predictions on test data

In [92]:
# Predict prices for the test data
y_pred = model.predict(X_test)

### 8. Evaluate Model Performance. 
#### Check the Accuracy using (r2) 
#### and the Mean Absolute Error

In [93]:
#Calculate R-squared score (accuracy)
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score {r2}")
      

#Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")

R-squared Score 0.9948521913384676
Mean Absolute Error: 7423.376749273044


## Check the Score

In [84]:
# Print the accuracy score
accuracy = model.score(X_test , y_test) 
print(f"Model Score: {accuracy:.4f}")

Model Score: 0.9952


### Step 9: Visualize the Regression Line 

###  Key Takeaways 
1. Simple Linear Regression models the relationship between one feature and a target
2. Evaluate Metrics:
    * R-squared measures how well the model explains the data
    * MAE gives the average prediction error


# Full Code here without step by step interruptions and explanations

In [None]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('train.csv')


# Select feature and target
X = data[['hp', 'on road old', 'on road now', 'years', 'km', 'rating', 'condition', 'economy', 'top speed', 'torque']] #Feature independent variable
y = data['current price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred)}")
print(f"MAE: {mean_absolute_error(y_test, y_pred)}")

# Plot regression line
plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.plot(X_train, model.predict(X_train), color='red', label='Regression Line')
plt.xlabel('Horsepower (hp)')
plt.ylabel('Current Price')
plt.legend()
plt.show()

R-squared: 0.9951798797617527
MAE: 7569.793607126303
