# üß† Girlfriend Prediction Using Logistic Regression (Sklearn)

## Project Overview
This notebook implements a **Logistic Regression classifier using scikit-learn** to predict whether an Indian boy has a girlfriend based on various personal attributes.

### Key Features:
- üîß Uses sklearn's `Pipeline` for clean workflow
- üìä Automatic feature scaling with `StandardScaler`
- üéØ Binary classification with `LogisticRegression`
- ‚ö° Quick and efficient training

---

## 1. Import Libraries
We import essential libraries:
- **pandas**: Data manipulation and loading
- **numpy**: Array operations
- **sklearn**: Machine learning tools (train_test_split, Pipeline, LogisticRegression, StandardScaler)

In [44]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

## 2. Load Dataset

We load the `indian_boys_gf_prediction_balanced.csv` dataset containing 300 balanced samples.

### Dataset Features:
| Feature | Description | Range |
|---------|-------------|-------|
| `age` | Age in years | 18-30 |
| `height_cm` | Height in centimeters | 155-190 |
| `income_lpa` | Annual income in Lakhs Per Annum | 1.5-20 |
| `fitness_level` | Self-rated fitness score | 1-10 |
| `confidence` | Self-rated confidence score | 1-10 |
| `social_media_hours` | Daily social media usage | 0.5-6.0 |
| **`has_gf`** | Target variable (0 = No, 1 = Yes) | 0/1 |

In [45]:
df = pd.read_csv("indian_boys_gf_prediction_balanced.csv")
df.head()

Unnamed: 0,age,height_cm,income_lpa,fitness_level,confidence,social_media_hours,has_gf
0,23,164,8.8,1,2,5.0,1
1,27,158,18.33,4,1,1.0,0
2,21,160,11.69,1,10,1.2,1
3,28,158,12.18,7,1,5.9,1
4,20,173,9.25,9,2,3.6,0


## 3. Prepare Features and Target

We separate the dataset into:
- **X**: Feature matrix (all columns except `has_gf`)
- **y**: Target vector (`has_gf` column)

In [46]:
X = df.drop(["has_gf"], axis= "columns")
y = df["has_gf"]

## 4. Train-Test Split

We split the dataset into:
- **Training set (70%)**: Used to train the model
- **Test set (30%)**: Used to evaluate model performance

The data is converted to NumPy arrays for compatibility.

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .3)

X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

print("The shape of X_train is: ", X_train.shape)
print("the shape of y_train is: ", y_train.shape)
print("the shape of X_test is: ", X_test.shape)
print("The shape of y_test is: ", y_test.shape)

The shape of X_train is:  (210, 6)
the shape of y_train is:  (210,)
the shape of X_test is:  (90, 6)
The shape of y_test is:  (90,)


## 5. Create ML Pipeline

We create a scikit-learn `Pipeline` that combines:

1. **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance
   $$z = \frac{x - \mu}{\sigma}$$

2. **LogisticRegression**: Binary classifier using the sigmoid function
   $$P(y=1|x) = \frac{1}{1 + e^{-(w^Tx + b)}}$$

> **Why use a Pipeline?** It ensures that the same preprocessing steps are applied consistently during training and prediction, preventing data leakage.

In [48]:
model_pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("logreg", LogisticRegression())
    ]
)

## 6. Train the Model

We fit the pipeline on the training data. This will:
1. Fit the `StandardScaler` on `X_train` (learn mean and std)
2. Transform `X_train` using the learned statistics
3. Fit the `LogisticRegression` on the scaled features

In [49]:
model_pipe.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('logreg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


## 7. Evaluate Model Performance

We evaluate the model on the test set using the `.score()` method, which returns the accuracy (proportion of correct predictions).

> The pipeline automatically scales the test data using the **training set statistics** before making predictions.

In [50]:
model_pipe.score(X_test, y_test)

0.5777777777777777

## 8. Prepare New User Input

Let's test our model with a hypothetical user:
- **Age**: 25
- **Height**: 160 cm
- **Income**: 20 LPA
- **Fitness**: 10/10
- **Confidence**: 7/10
- **Social Media**: 5 hours/day

In [51]:
user_input = pd.DataFrame([{
    "age": 25,
    "height_cm": 160,
    "income_lpa": 20,
    "fitness_level": 10,
    "confidence": 7,
    "social_media_hours": 5
}])
user_input = user_input.values

## 9. Make Prediction

We use `predict_proba()` to get the probability of each class:
- `predict_proba(X)[0][0]` = Probability of `has_gf = 0` (No GF)
- `predict_proba(X)[0][1]` = Probability of `has_gf = 1` (Has GF)

The pipeline automatically handles feature scaling for new inputs.

In [52]:
print("Probabilty of getting a gf is: ", model_pipe.predict_proba(user_input)[0][1])

Probabilty of getting a gf is:  0.7447408746039609


---

# üìä Conclusion

## Summary
In this notebook, we implemented a **Logistic Regression classifier using scikit-learn's Pipeline** to predict relationship status based on personal attributes.

## Key Takeaways

### üîß What We Used:
1. **StandardScaler** for automatic feature normalization
2. **LogisticRegression** for binary classification
3. **Pipeline** for clean, reproducible workflow

### üìà Model Performance:
- Test accuracy is moderate (~58%), which is expected given:
  - The inherently random/personal nature of relationships
  - Limited feature set
  - Small dataset (300 samples)

### üí° Advantages of Using sklearn:
- **Less code**: Pipeline handles scaling and training in one step
- **No data leakage**: Test data is always scaled using training statistics
- **Easy predictions**: New data is automatically preprocessed
- **Built-in regularization**: L2 regularization by default

## Comparison with Manual Implementation
| Aspect | Manual (gf_pred_manual.ipynb) | Sklearn (This Notebook) |
|--------|-------------------------------|-------------------------|
| Code complexity | Higher | Lower |
| Educational value | Higher | Lower |
| Flexibility | Full control | Limited to sklearn API |
| Performance | Similar | Similar (with L2 regularization) |

---

### ‚ö†Ô∏è Disclaimer
*This is a fun educational project to learn logistic regression. Relationship outcomes depend on countless factors beyond what any model can capture!*