In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df= pd.read_csv('datasets/cleaned_titanic.csv')

## Cleaning Columns for Regression
### What I Did

- Selected only the columns needed for regression.
- Dropped rows with missing Age or Fare.
- Kept numeric variables only.

In [None]:
df_reg = df[['Age', 'Pclass', 'Fare']].dropna()
df_reg.shape

## Splitting Data into Train and Test
### What I Did

- Used train_test_split() to create 80/20 split.
- Separated features (Age, Pclass) and target (Fare).

In [None]:
X=df_reg[['Age','Pclass']]
y=df_reg['Fare']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

                                                    
X_train.shape, X_test.shape

## Training the Linear Regression Model
### What I Did

- Created a LinearRegression() model object.
- Fit it on the training data.
- Checked coefficients learned by the model.

In [63]:
model = LinearRegression()
model.fit(X_train,y_train)

model.intercept_, model.coef_

(np.float64(128.0363040863641),
 array([ -0.15353665, -34.02658916, -15.4528103 , -15.42227514,
         -8.46885428,  10.54376764]))

## Evaluating the Model
###  What I Did

- Calculated R² score on the test set.
- R² (0 to 1) shows how well the model explains variation in Fare.

In [32]:
model.score(X_test, y_test)

0.1917633318253047

### Insights

- Fare is difficult to predict with just Age and Pclass, so R² will not be very high. That’s expected and normal.
- The model explains 19% of the variance in passenger fares on the titanic.
- This means that the model is very weak and cannot predict fares accurately only with Age and Pclass
- 81% of fares variations are explained by other factors that are not included in this model.

### Including other factors to improve models predictibility
- Sex
- Sibsp
- Embarked
  
Encoding categorical features embarked and sex must be encoded. Sex is converted into numerical values by binary mapping and embarked is encoding using one hot encoded method

In [57]:
df['Sex'] = df['Sex'].map({'male': 1, 'female': 0})


**Why drop the first category (drop_first=True)**

Because without dropping one reference category, you get perfect multicollinearity. 

In [58]:
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


In [59]:
df_reg = df[['Age', 'Pclass', 'Fare','Embarked_Q','Sex', 'Embarked_S', 'SibSp']].dropna()
df_reg.shape

(714, 7)

In [60]:
X=df_reg[['Age','Pclass','Embarked_Q','Embarked_S','Sex','SibSp']]
y=df_reg['Fare']

In [61]:
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

((571, 6), (143, 6))

In [62]:
model=LinearRegression()
model.fit(X_train, y_train)


model.intercept_, model.coef_

(np.float64(128.0363040863641),
 array([ -0.15353665, -34.02658916, -15.4528103 , -15.42227514,
         -8.46885428,  10.54376764]))

### Interpretation
Intercept: 128.0363
Coefficients:
Age: -0.1535
Pclass: -34.0266
Embarked_Q: -15.4528
Embarked_S: -15.4223
Sex: -8.4689
SibSp: 10.5438

Linear regression coefficients tell us how much the target variable (Fare) changes when a feature increases by one unit, holding all other features constant

- Positive coefficient: increases Fare
- Negative coefficient: decreases Fare
- Larger absolute value: stronger effect
### Strength of Relationships

- Strong: |Pclass|
- Moderate: |Embarked_Q|, |Embarked_S|, |SibSp|
- Weak–Moderate: |Sex|
- Very Weak: |Age|

### Insights 
- Ticket class is the dominant driver of Fare.
- Embarkation port and family size meaningfully influence the price paid.
- Age has almost no direct predictive power for Fare.Sex has a noticeable but smaller effect compared to other predictors.

**What happens after adding more features?**

Including eatures such as Sex, embarked, Sibsp, Linear Regression model prediction improved significantly because Age and Pclass alone were not enough to produce reliable results. R² intercepted 0.1917 which means the model captures only  19% of the fare variance on titanic which was expected with limited features. Adding more predictors gave more information about passengers profile allowing the model to capture more patterns.However since Linear Regression model must work on numerical features, therefore, Sex and Embarked must be encoded before fitting into the model. 