# üüß The Orange Problem ‚Äî Complete End-to-End Data Analysis and Linear Regression
### Dataset: Loan Prediction Dataset (train.csv)

This notebook performs all steps required for *The Orange Problem*, including data cleaning, statistical analysis, visualization, confidence intervals, hypothesis testing, and linear regression modeling.


In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv('train.csv')
df.head()

## 1Ô∏è‚É£ Data Cleaning
Handle missing values appropriately for each column type.

In [None]:
# Check missing values
print('Missing values before cleaning:')
print(df.isnull().sum())

# Fill missing values
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)

print('\nMissing values after cleaning:')
print(df.isnull().sum())

## 2Ô∏è‚É£ Descriptive Statistics
Generate summary statistics for numerical and categorical variables.

In [None]:
# Descriptive statistics for numerical features
df.describe()

In [None]:
# Frequency distribution for categorical variables
for col in ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']:
    print(f'\n{col} value counts:')
    print(df[col].value_counts())

## 3Ô∏è‚É£ Data Visualization
Visualize distributions, relationships, and correlations.

In [None]:
# Histograms for numeric variables
df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].hist(bins=20, figsize=(10,6))
plt.suptitle('Distributions of Key Numeric Variables')
plt.show()

# Boxplot for LoanAmount
sns.boxplot(x=df['LoanAmount'])
plt.title('Loan Amount Boxplot')
plt.show()

# Scatter plot - ApplicantIncome vs LoanAmount
sns.scatterplot(x='ApplicantIncome', y='LoanAmount', hue='Loan_Status', data=df)
plt.title('Applicant Income vs Loan Amount')
plt.show()

# Correlation Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

## 4Ô∏è‚É£ Normality Check (LoanAmount)
Use a Q-Q plot to visually inspect whether LoanAmount follows a normal distribution.

In [None]:
stats.probplot(df['LoanAmount'], dist='norm', plot=plt)
plt.title('Q-Q Plot for Loan Amount')
plt.show()

## 5Ô∏è‚É£ Confidence Interval (95%) for LoanAmount Mean

In [None]:
loan_data = df['LoanAmount']
mean = np.mean(loan_data)
std_err = stats.sem(loan_data)
h = std_err * stats.t.ppf((1 + 0.95) / 2, len(loan_data) - 1)
print(f"95% CI for mean LoanAmount: ({mean - h:.2f}, {mean + h:.2f})")

## 6Ô∏è‚É£ Hypothesis Testing
Test whether the mean LoanAmount differs significantly from 140.

In [None]:
t_stat, p_value = stats.ttest_1samp(df['LoanAmount'], 140)
alpha = 0.05
print(f"T-statistic = {t_stat:.2f}, P-value = {p_value:.4f}")
if p_value < alpha:
    print('Reject H0 ‚Üí Mean LoanAmount differs significantly from 140')
else:
    print('Fail to reject H0 ‚Üí No significant difference from 140')

## 7Ô∏è‚É£ Linear Regression ‚Äî Predict LoanAmount using ApplicantIncome & CoapplicantIncome

In [None]:
X = df[['ApplicantIncome', 'CoapplicantIncome']]
y = df['LoanAmount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'MSE: {mse:.2f}\nRMSE: {rmse:.2f}\nR¬≤: {r2:.2f}')
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

plt.scatter(y_test, y_pred)
plt.xlabel('Actual LoanAmount')
plt.ylabel('Predicted LoanAmount')
plt.title('Actual vs Predicted Loan Amount')
plt.show()

## 8Ô∏è‚É£ Multiple Regression (with Categorical Encoding)

In [None]:
df_encoded = pd.get_dummies(df, columns=['Gender','Married','Dependents','Education','Self_Employed','Property_Area'], drop_first=True)

X2 = df_encoded.drop(columns=['Loan_ID','Loan_Status'])
y2 = df_encoded['LoanAmount']

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=42)

model2 = LinearRegression()
model2.fit(X_train2, y_train2)
y_pred2 = model2.predict(X_test2)

mse2 = mean_squared_error(y_test2, y_pred2)
rmse2 = np.sqrt(mse2)
r2_2 = r2_score(y_test2, y_pred2)

print(f'MSE: {mse2:.2f}\nRMSE: {rmse2:.2f}\nR¬≤: {r2_2:.2f}')

plt.scatter(y_test2, y_pred2)
plt.xlabel('Actual LoanAmount')
plt.ylabel('Predicted LoanAmount')
plt.title('Actual vs Predicted (Multiple Regression)')
plt.show()

‚úÖ *End of The Orange Problem ‚Äî Complete Analysis*