## Using classification

# Titanic Dataset - Feature Descriptions

| Feature Name | Description                                                                                 |
|--------------|---------------------------------------------------------------------------------------------|
| **Pclass**   | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd), representing socio-economic status            |
| **Sex**      | Gender of the passenger (0 = female, 1 = male)                                             |
| **Age**      | Age of the passenger in years                                                              |
| **SibSp**    | Number of siblings and spouses aboard the Titanic                                           |
| **Parch**    | Number of parents and children aboard the Titanic                                          |
| **Fare**     | Passenger fare paid (in British Pounds)                                                    |

**Target Variable:**

| Feature Name | Description                                                                         |
|--------------|-------------------------------------------------------------------------------------|
| **Survived** | Survival status (0 = Did not survive, 1 = Survived)                                |

---

### Notes:
- **Pclass** indicates socio-economic status, with 1st class being the highest.
- **Sex** is encoded as numeric for modeling purposes.
- Family-related features (**SibSp** and **Parch**) capture the number of close relatives aboard.
- **Fare** reflects the ticket price, often correlated with class and survival chances.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)

# Drop rows where target (Survived) is missing
titanic_data = titanic_data.dropna(subset=['Survived'])

# -------------------------------
# 🔹 Save and show first 5 rows before preprocessing
original_rows = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].head()
print("\n🔹 Original Titanic Data (First 5 Rows Before Preprocessing):")
print(original_rows)



# -------------------------------
# 🔹 Preprocess the data

# Select features and target
X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].copy()
y = titanic_data['Survived']

# Encode 'Sex' (female = 0, male = 1)
X['Sex'] = X['Sex'].map({'female': 0, 'male': 1})

# Fill missing 'Age' values with median
X['Age'].fillna(X['Age'].median(), inplace=True)

# -------------------------------
# 🔹 Show the same rows after preprocessing
print("\n🔹 Same Rows After Preprocessing:")
print(X.loc[original_rows.index])

# -------------------------------
# 🔹 Show a subset of preprocessed training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\n🔹 Subset of Training Data (First 5 Rows):")
print(X_train.head())

print("\n🔹 Subset of Training Labels (First 5 Rows):")
print(y_train.head())

# -------------------------------
# 🔹 Train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"\n✅ Accuracy: {accuracy:.2f}")
print("\n✅ Classification Report:\n", classification_rep)

# -------------------------------
# 🔹 Predict survival for one sample passenger
sample = X_test.iloc[0:1]
prediction = rf_classifier.predict(sample)

sample_dict = sample.iloc[0].to_dict()
print(f"\n🔹 Sample Passenger Features: {sample_dict}")
print(f"🔮 Predicted Survival: {'Survived' if prediction[0] == 1 else 'Did Not Survive'}")



🔹 Original Titanic Data (First 5 Rows Before Preprocessing):
   Pclass     Sex   Age  SibSp  Parch     Fare
0       3    male  22.0      1      0   7.2500
1       1  female  38.0      1      0  71.2833
2       3  female  26.0      0      0   7.9250
3       1  female  35.0      1      0  53.1000
4       3    male  35.0      0      0   8.0500

🔹 Same Rows After Preprocessing:
   Pclass  Sex   Age  SibSp  Parch     Fare
0       3    1  22.0      1      0   7.2500
1       1    0  38.0      1      0  71.2833
2       3    0  26.0      0      0   7.9250
3       1    0  35.0      1      0  53.1000
4       3    1  35.0      0      0   8.0500

🔹 Subset of Training Data (First 5 Rows):
     Pclass  Sex   Age  SibSp  Parch     Fare
331       1    1  45.5      0      0  28.5000
733       2    1  23.0      0      0  13.0000
382       3    1  32.0      0      0   7.9250
704       3    1  26.0      1      0   7.8542
813       3    0   6.0      4      2  31.2750

🔹 Subset of Training Labels (First 5 R

## Using Regression

# California Housing Dataset - Feature Descriptions

| Feature Name | Description                                                                                      |
|--------------|------------------------------------------------------------------------------------------------|
| **MedInc**   | Median income in the block group (in tens of thousands of dollars)                              |
| **HouseAge** | Median age of houses in the block group (in years)                                             |
| **AveRooms** | Average number of rooms per household                                                           |
| **AveBedrms**| Average number of bedrooms per household                                                        |
| **Population** | Total population of the block group                                                           |
| **AveOccup** | Average number of occupants per household                                                       |
| **Latitude** | Geographic latitude of the block group                                                          |
| **Longitude**| Geographic longitude of the block group                                                         |

**Target Variable:**

| Feature Name | Description                                                      |
|--------------|------------------------------------------------------------------|
| **MEDV**     | Median house value for California districts (in $100,000s USD)  |

---

### Notes:
- Features like **MedInc**, **HouseAge**, and **AveRooms** relate to socioeconomic and housing characteristics.
- **Latitude** and **Longitude** provide geographical location information.
- The target variable **MEDV** is what the model tries to predict — the median house price.


In [5]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# ----------------------------------
# 🔹 Load the California Housing Dataset
california_housing = fetch_california_housing()

# Create DataFrame from the data
california_data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target

# ----------------------------------
# 🔹 Show dataset info and features
print("\n📊 Dataset Info:")
print(f"Total rows: {california_data.shape[0]}, Total columns: {california_data.shape[1]}")

print("\n🔹 Feature Names:")
print(list(california_housing.feature_names))

print("\n🔹 First 5 Rows of Feature Data:")
print(california_data.head())

# ----------------------------------
# 🔹 Define features (X) and target (y)
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']

# ----------------------------------
# 🔹 Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("\n🔹 Sample of Training Data (first 5 rows):")
print(X_train.head())

# ----------------------------------
# 🔹 Create and train the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# ----------------------------------
# 🔹 Predict on test set
y_pred = rf_regressor.predict(X_test)

# ----------------------------------
# 🔹 Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# ----------------------------------
# 🔹 Predict a single instance
single_data = X_test.iloc[0].values.reshape(1, -1)
predicted_value = rf_regressor.predict(single_data)

print("\n🔎 🔹 Single Test Example:")
print("Input Features:", dict(zip(X.columns, X_test.iloc[0])))
print(f"Predicted Value: {predicted_value[0]:.2f}")
print(f"Actual Value: {y_test.iloc[0]:.2f}")

# ----------------------------------
# 🔹 Show model performance
print(f"\n✅ Mean Squared Error: {mse:.2f}")
print(f"✅ R-squared Score: {r2:.2f}")



📊 Dataset Info:
Total rows: 20640, Total columns: 9

🔹 Feature Names:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

🔹 First 5 Rows of Feature Data:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude   MEDV  
0    -122.23  4.526  
1    -122.22  3.585  
2    -122.24  3.521  
3    -122.25  3.413  
4    -122.25  3.422  

🔹 Sample of Training Data (first 5 rows):
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
14196  3.2596      33.0  5.017657   1.006421      2300.0  3.691814     32.71   
8267 