# Mini Machine Learning Project

In this notebook, we will build a simple machine learning project from start to finish. 
We'll go through loading data, exploring it, cleaning, preparing, training models, evaluating, and finally drawing conclusions.

Let's get started!

## 1. Load Data

First, we load our dataset. For this example, you can choose any dataset you are interested in. 
Here, we'll just create a small sample dataset as an example.

In [None]:
import pandas as pd

# Sample dataset: Titanic data with a few features
data = {
    'Age': [22, 38, 26, 35, 28],
    'Sex': ['male', 'female', 'female', 'male', 'male'],
    'Pclass': [3, 1, 3, 1, 3],
    'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05],
    'Survived': [0, 1, 1, 1, 0]
}
df = pd.DataFrame(data)
df.head()

## 2. Explore Data

Let's understand the structure of our data. We'll check basic statistics and see the distribution of features.

In [None]:
# Check basic info
df.info()

In [None]:
# Summary statistics
df.describe()

## 3. Clean Data

Handle missing values if any and prepare data for modeling.
In this example, our dataset is complete, but typically you'd check and handle missing data.

In [None]:
# Check for missing values
df.isnull().sum()

Since there are no missing values in this small sample, we proceed.
In real datasets, you might fill missing values or drop rows as needed.

## 4. Prepare Data

Before training, we need to encode categorical variables and split the data into features and labels.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode 'Sex' feature
le = LabelEncoder()
df['Sex_encoded'] = le.fit_transform(df['Sex'])

# Define features and target
X = df[['Age', 'Sex_encoded', 'Pclass', 'Fare']]
y = df['Survived']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 5. Train Models

Let's train two different models: Logistic Regression and Random Forest.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Initialize models
lr_model = LogisticRegression()
rf_model = RandomForestClassifier()

# Train models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

## 6. Evaluate Models

Now, let's evaluate how well our models perform on the test set.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Predictions
lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

# Accuracy scores
lr_accuracy = accuracy_score(y_test, lr_preds)
rf_accuracy = accuracy_score(y_test, rf_preds)

# Classification reports
lr_report = classification_report(y_test, lr_preds)
rf_report = classification_report(y_test, rf_preds)

print("Logistic Regression Accuracy:", lr_accuracy)
print("Random Forest Accuracy:", rf_accuracy)

print("\nLogistic Regression Report:\n", lr_report)
print("\nRandom Forest Report:\n", rf_report)

## 7. Conclusions

Based on the evaluation, you can compare the models and decide which one performs better. 
Further improvements could include tuning hyperparameters, engineering new features, or trying more advanced algorithms.

Remember, the goal is to learn and improve, so experiment and enjoy the process!