<a href="https://colab.research.google.com/github/AqueeqAzam/supervised-learning/blob/main/supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1️⃣ Regression (Predicting a Continuous Value)
--
Algorithm	Best For	Example Data	Data Type

Linear Regression	Simple relationships	House Price ~ Area	Continuous (Numerical)

Multiple Linear Regression	Multiple factors	House Price ~ Area + Location + Bedrooms	Continuous & Categorical

Polynomial Regression	Curved relationships	Salary Growth Over Years	Continuous (with Non-Linearity)

Support Vector Regression (SVR)	Complex relationships	Stock Market Trends	Continuous (Non-Linear)

k-NN Regression	Small datasets, local patterns	Predicting car price based on similar models	Continuous, Categorical


1️⃣ Simple Linear Regression
📌 Use Case: Price (target) vs. Area (feature)

📌 Best When: ✅ There is a linear relationship between Price and Area. ✅ You need a simple and interpretable model. ✅ Dataset is small to medium-sized.

⚠️ Avoid When: ❌ The relationship between features is non-linear. ❌ There are multiple independent variables affecting the target.

🔧 Industry Applications:

Real Estate Pricing (Zillow, Redfin, Realtor) → Predicting home prices based on Area. Salary Estimation (HR & Job Market Analytics) → Estimating salary based on years of experience.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Create dataset
data = {
    'House_ID': [1, 2, 3],
    'Type': ['Apartment', 'Villa', 'Apartment'],
    'Price': [500000, 750000, 600000],
    'Construction_Date': ['2020-01-01', '2018-05-15', '2019-07-20'],
    'Is_Active': [np.nan, False, True],
    'Review': ["Great place!", "Not worth it.", "Highly recommended."],
    'Area': [1200, 1500, 1300],
    'Rating': [4, 5, np.nan]
}

df = pd.DataFrame(data)

# Convert categorical feature 'Type' into numerical values
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])

# Impute missing values in 'Rating' with mean
imputer = SimpleImputer(strategy='mean')
df['Rating'] = imputer.fit_transform(df[['Rating']])

# Define features (X) and target (y)
X = df[['Area', 'Rating']]
y = df['Price']

# Train linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict prices based on the given data
predictions = model.predict(X)

# Output predictions
df['Predicted_Price'] = predictions
df[['Price', 'Predicted_Price']]

# Calculate R-squared score
r2 = r2_score(y, predictions)

# Regression equation coefficients
intercept = model.intercept_
coefficients = model.coef_

# Display results
df['Predicted_Price'] = predictions
print(f"Regression Equation: Price = {intercept:.2f} + ({coefficients[0]:.2f} * Area) + ({coefficients[1]:.2f} * Rating)")
print(f"R-squared Score: {r2:.4f}")

df[['Price', 'Predicted_Price']]


2️⃣ Multiple Linear Regression
📌 Use Case: Price (target) vs. Area, Rating, Latitude, Longitude 📌 Best When: ✅ The target variable is influenced by multiple factors. ✅ There is a linear relationship between features and Price.

⚠️ Avoid When: ❌ Features have high multicollinearity (strong correlation among independent variables). ❌ The dataset has non-linear relationships (Polynomial Regression or Gradient Boosting may be better).

🔧 Industry Applications:

Real Estate (Realtor, Zillow, Airbnb) → Predicting property price based on location, area, and customer rating. E-commerce (Amazon, Flipkart, Shopify) → Estimating product demand based on multiple factors like price, rating, seasonality, and location.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score

# Create dataset
data = {
    'House_ID': [1, 2, 3],
    'Type': ['Apartment', 'Villa', 'Apartment'],
    'Price': [500000, 750000, 600000],
    'Construction_Date': ['2020-01-01', '2018-05-15', '2019-07-20'],
    'Is_Active': [np.nan, False, True],
    'Review': ["Great place!", "Not worth it.", "Highly recommended."],
    'Area': [1200, 1500, 1300],
    'Rating': [4, 5, np.nan]
}

df = pd.DataFrame(data)

# Convert categorical feature 'Type' into numerical values
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])

# Convert boolean 'Is_Active' to numerical (0/1) and handle NaN values
imputer_bool = SimpleImputer(strategy='most_frequent')
df['Is_Active'] = imputer_bool.fit_transform(df[['Is_Active']])

# Impute missing values in 'Rating' with mean
imputer = SimpleImputer(strategy='mean')
df['Rating'] = imputer.fit_transform(df[['Rating']])

# Define features (X) and target (y)
X = df[['Area', 'Rating', 'Type', 'Is_Active']]
y = df['Price']

# Train multiple linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict prices based on the given data
predictions = model.predict(X)

# Calculate R-squared score
r2 = r2_score(y, predictions)

# Regression equation coefficients
intercept = model.intercept_
coefficients = model.coef_

# Display results
df['Predicted_Price'] = predictions
print(f"Regression Equation: Price = {intercept:.2f} + ({coefficients[0]:.2f} * Area) + ({coefficients[1]:.2f} * Rating) + ({coefficients[2]:.2f} * Type) + ({coefficients[3]:.2f} * Is_Active)")
print(f"R-squared Score: {r2:.4f}")

df[['Price', 'Predicted_Price']]


3️⃣ Polynomial Regression
📌 Use Case: Add polynomial terms for Area or Rating 📌 Best When: ✅ The relationship between Price and Area or Rating is non-linear. ✅ You need a more flexible model than Linear Regression.

⚠️ Avoid When: ❌ Overfitting risk is high (especially with high-degree polynomials). ❌ The dataset is small (adding polynomial terms can increase complexity without benefit).

🔧 Industry Applications:

Stock Market Predictions (Financial Markets) → Predicting stock prices based on historical trends. Energy Consumption Forecasting (Utility Companies) → Estimating electricity usage based on temperature, time, and past usage patterns.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
from sklearn.metrics import r2_score

# Create dataset
data = {
    'House_ID': [1, 2, 3],
    'Type': ['Apartment', 'Villa', 'Apartment'],
    'Price': [500000, 750000, 600000],
    'Construction_Date': ['2020-01-01', '2018-05-15', '2019-07-20'],
    'Is_Active': [np.nan, False, True],
    'Review': ["Great place!", "Not worth it.", "Highly recommended."],
    'Area': [1200, 1500, 1300],
    'Rating': [4, 5, np.nan]
}

df = pd.DataFrame(data)

# Convert categorical feature 'Type' into numerical values
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])

# Convert boolean 'Is_Active' to numerical (0/1) and handle NaN values
imputer_bool = SimpleImputer(strategy='most_frequent')
df['Is_Active'] = imputer_bool.fit_transform(df[['Is_Active']])

# Impute missing values in 'Rating' with mean
imputer = SimpleImputer(strategy='mean')
df['Rating'] = imputer.fit_transform(df[['Rating']])

# Define features (X) and target (y)
X = df[['Area', 'Rating', 'Type', 'Is_Active']]
y = df['Price']

# Apply Polynomial Features (degree=2 for quadratic regression)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Train Polynomial Regression Model
model = LinearRegression()
model.fit(X_poly, y)

# Predict prices based on the given data
predictions = model.predict(X_poly)

# Calculate R-squared score
r2 = r2_score(y, predictions)

# Regression equation coefficients
intercept = model.intercept_
coefficients = model.coef_
feature_names = poly.get_feature_names_out(['Area', 'Rating', 'Type', 'Is_Active'])

# Display results
df['Predicted_Price'] = predictions
print("Regression Equation:")
equation = f"Price = {intercept:.2f}"
for coef, feature in zip(coefficients, feature_names):
    equation += f" + ({coef:.2f} * {feature})"
print(equation)
print(f"R-squared Score: {r2:.4f}")

# Plot actual vs predicted prices
plt.scatter(y, predictions, color='blue')
plt.plot([min(y), max(y)], [min(y), max(y)], linestyle='--', color='red')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Prices")
plt.show()

df[['Price', 'Predicted_Price']]


4️⃣ SVM Regression (Support Vector Regression - SVR)
---

📌 Use Case: Price (target) vs. Area, Rating 📌 Best When: ✅ The dataset is small to medium-sized and contains outliers. ✅ You need a robust model that doesn’t overfit easily.

⚠️ Avoid When: ❌ The dataset is large (SVM can be computationally expensive). ❌ You need a fully interpretable model (SVM is harder to interpret than Linear Regression).

🔧 Industry Applications:

House Price Estimation (Zillow, Redfin) → Predicting home value based on multiple features, handling outliers better than Linear Regression. Healthcare (Medical Diagnosis) → Estimating disease risk based on patient symptoms.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import r2_score

# Create dataset
data = {
    'House_ID': [1, 2, 3],
    'Type': ['Apartment', 'Villa', 'Apartment'],
    'Price': [500000, 750000, 600000],
    'Construction_Date': ['2020-01-01', '2018-05-15', '2019-07-20'],
    'Is_Active': [np.nan, False, True],
    'Review': ["Great place!", "Not worth it.", "Highly recommended."],
    'Area': [1200, 1500, 1300],
    'Rating': [4, 5, np.nan]
}

df = pd.DataFrame(data)

# Convert categorical feature 'Type' into numerical values
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])

# Convert boolean 'Is_Active' to numerical (0/1) and handle NaN values
imputer_bool = SimpleImputer(strategy='most_frequent')
df['Is_Active'] = imputer_bool.fit_transform(df[['Is_Active']])

# Impute missing values in 'Rating' with mean
imputer = SimpleImputer(strategy='mean')
df['Rating'] = imputer.fit_transform(df[['Rating']])

# Define features (X) and target (y)
X = df[['Area', 'Rating', 'Type', 'Is_Active']]
y = df['Price']

# Standardize features (important for SVR)
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()

# Train SVR Model
svr_model = SVR(kernel='rbf')  # Using Radial Basis Function kernel
svr_model.fit(X_scaled, y_scaled)

# Predict prices based on the given data
predictions_scaled = svr_model.predict(X_scaled)
predictions = scaler_y.inverse_transform(predictions_scaled.reshape(-1, 1)).flatten()

# Calculate R-squared score
r2 = r2_score(y, predictions)

# Display results
df['Predicted_Price'] = predictions
print(f"R-squared Score: {r2:.4f}")

# Plot actual vs predicted prices
plt.scatter(y, predictions, color='blue')
plt.plot([min(y), max(y)], [min(y), max(y)], linestyle='--', color='red')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Prices (SVR)")
plt.show()

df[['Price', 'Predicted_Price']]


5️⃣ k-NN Regression
---


📌 Use Case: Price (target) vs. Area, Rating 📌 Best When: ✅ The relationship between variables is non-linear. ✅ The dataset is small to medium-sized (k-NN is memory-intensive).

⚠️ Avoid When: ❌ The dataset is large (k-NN is slow as it stores all data points and computes distances). ❌ There are many irrelevant features (k-NN can be affected by noise).

🔧 Industry Applications:

Recommendation Systems (Netflix, Spotify, Amazon) → Suggesting similar movies or products based on past interactions. Predicting Housing Prices (Real Estate) → Finding similar properties based on Price, Area, and Rating.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import r2_score

# Create dataset
data = {
    'House_ID': [1, 2, 3],
    'Type': ['Apartment', 'Villa', 'Apartment'],
    'Price': [500000, 750000, 600000],
    'Construction_Date': ['2020-01-01', '2018-05-15', '2019-07-20'],
    'Is_Active': [np.nan, False, True],
    'Review': ["Great place!", "Not worth it.", "Highly recommended."],
    'Area': [1200, 1500, 1300],
    'Rating': [4, 5, np.nan]
}

df = pd.DataFrame(data)

# Convert categorical feature 'Type' into numerical values
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])

# Convert boolean 'Is_Active' to numerical (0/1) and handle NaN values
imputer_bool = SimpleImputer(strategy='most_frequent')
df['Is_Active'] = imputer_bool.fit_transform(df[['Is_Active']])

# Impute missing values in 'Rating' with mean
imputer = SimpleImputer(strategy='mean')
df['Rating'] = imputer.fit_transform(df[['Rating']])

# Define features (X) and target (y)
X = df[['Area', 'Rating', 'Type', 'Is_Active']]
y = df['Price']

# Standardize features (important for k-NN)
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()

# Train k-NN Regression Model
knn_model = KNeighborsRegressor(n_neighbors=2)  # Using 2 nearest neighbors
knn_model.fit(X_scaled, y_scaled)

# Predict prices based on the given data
predictions_scaled = knn_model.predict(X_scaled)
predictions = scaler_y.inverse_transform(predictions_scaled.reshape(-1, 1)).flatten()

# Calculate R-squared score
r2 = r2_score(y, predictions)

# Display results
df['Predicted_Price'] = predictions
print(f"R-squared Score: {r2:.4f}")

# Plot actual vs predicted prices
plt.scatter(y, predictions, color='blue')
plt.plot([min(y), max(y)], [min(y), max(y)], linestyle='--', color='red')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Prices (k-NN Regression)")
plt.show()

df[['Price', 'Predicted_Price']]


2️⃣ Classification (Categorizing Data)
--
Algorithm Best For Example Data Data Type Logistic Regression Binary classes Spam vs. Not Spam Categorical (0/1)

Multiclass Logistic Regression Multiple categories Classifying plants into species Categorical (3+ classes)

Decision Tree Simple, rule-based classification Approving a loan Mixed (Numerical & Categorical)

Random Forest More accurate than a single decision tree Fraud Detection Mixed (Numerical & Categorical)

Support Vector Machine (SVM) High-dimensional data Handwriting Recognition Continuous, Categorical

k-NN Classification Small datasets, local patterns Disease classification based on symptoms Mixed (Numerical & Categorical)

1️⃣ Logistic Regression (Binary Class)
---


📌 Use Case: Is_Active (target) vs. Price, Area

📌 Best When: ✅ Data is linearly separable (clear boundary exists). ✅ You need probability estimates for predictions. ✅ The dataset is not too large.

⚠️ Avoid When: ❌ Data is non-linear (Logistic Regression struggles with complex decision boundaries). ❌ There are too many irrelevant features.

🔧 Industry Applications:

Customer Churn Prediction (Amazon, Netflix, SaaS platforms) → Predicting if a user will stay active or leave. Credit Risk Assessment (Banking & Finance) → Determining if a person is a high-risk borrower.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Sample Dataset
data = {
    'Type': ['Apartment', 'Villa', 'Apartment', 'Villa', 'Apartment', 'Villa'],
    'Is_Active': [1, 0, 1, 1, 0, 1],  # Target Variable
    'Area': [1200, 1500, 1300, 1600, 1250, 1550],  # Feature
    'Rating': [4, 5, 4.5, 3.5, 4.2, 4]  # Feature
}

df = pd.DataFrame(data)

# Convert categorical 'Type' into numerical (0 = Apartment, 1 = Villa)
df['Type'] = df['Type'].map({'Apartment': 0, 'Villa': 1})

# Define Features (X) and Target (y)
X = df[['Area', 'Rating', 'Type']]
y = df['Is_Active']

# Split Data into Train & Test Sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale Features (Logistic Regression benefits from scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression Model
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)

# Predict on Test Data
y_pred = log_model.predict(X_test_scaled)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Plot Decision Boundary (only possible for 2D features)
plt.scatter(X_test['Area'], X_test['Rating'], c=y_pred, cmap='coolwarm', edgecolors='k')
plt.xlabel("Area")
plt.ylabel("Rating")
plt.title("Logistic Regression Decision Boundary")
plt.show()


2️⃣ Multiclass Logistic Regression
--
📌 Use Case: Type (target) vs. Price, Area, Rating

📌 Best When: ✅ Data has multiple classes but remains linearly separable. ✅ Fast predictions are required.

⚠️ Avoid When: ❌ Data has complex, non-linear relationships. ❌ The dataset contains many interdependent features.

🔧 Industry Applications:

E-commerce Product Categorization (Amazon, Flipkart) → Classifying products as Luxury, Mid-Range, Budget. Property Type Prediction (Real Estate) → Classifying a house as Apartment, Villa, Townhouse.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Sample Dataset
data = {
    'Type': ['Apartment', 'Villa', 'Apartment', 'Villa', 'Studio', 'Apartment', 'Studio', 'Villa'],
    'Area': [1200, 1500, 1300, 1600, 1000, 1250, 900, 1550],  # Feature
    'Rating': [4, 5, 4.5, 3.5, 4.2, 4, 3.8, 4.7],  # Feature
    'Is_Active': [1, 0, 1, 1, 0, 1, 0, 1]  # Feature
}

df = pd.DataFrame(data)

# Convert categorical target 'Type' into numerical labels
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])  # (Apartment=0, Studio=1, Villa=2)

# Define Features (X) and Target (y)
X = df[['Area', 'Rating', 'Is_Active']]
y = df['Type']

# Split Data into Train & Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale Features (Logistic Regression benefits from scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Multiclass Logistic Regression Model
log_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
log_model.fit(X_train_scaled, y_train)

# Predict on Test Data
y_pred = log_model.predict(X_test_scaled)

# Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Visualizing Predictions
plt.scatter(X_test['Area'], X_test['Rating'], c=y_pred, cmap='coolwarm', edgecolors='k')
plt.xlabel("Area")
plt.ylabel("Rating")
plt.title("Multiclass Logistic Regression Decision Boundary")
plt.show()


3️⃣ Naïve Bayes
--
📌 Use Case: Type (target) vs. Price, Area, Rating

📌 Best When: ✅ The dataset is small. ✅ Features are mostly independent. ✅ You need fast model training and predictions.

⚠️ Avoid When: ❌ Features are highly correlated (e.g., Price and Area are often dependent). ❌ Data is non-linear (Naïve Bayes assumes simple decision boundaries).

🔧 Industry Applications:

Spam Detection (Facebook, Gmail, Twitter) → Classifying emails or comments as Spam, Promotional, or Normal. Quick Prototyping for Classification Models.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# ✅ 1. Load Previous Dataset
data = {
    'Review': ["Great place!", "Not worth it.", "Highly recommended.", "Terrible experience.", "Loved it!", "Would not recommend."],
    'Sentiment': [1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative
}

df = pd.DataFrame(data)

# ✅ 2. Convert Text into TF-IDF Features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Review'])  # Text data converted to numerical form
y = df['Sentiment']  # Target variable

# ✅ 3. Split Data into Train & Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ✅ 4. Train Naïve Bayes Model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# ✅ 5. Predict on Test Data
y_pred = nb_model.predict(X_test)

# ✅ 6. Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ✅ 7. Test with a New Review
new_review = ["The place was amazing and I loved the experience!"]
new_review_vectorized = vectorizer.transform(new_review)
prediction = nb_model.predict(new_review_vectorized)
print("\nNew Review Sentiment:", "Positive" if prediction[0] == 1 else "Negative")


4️⃣ Decision Tree
--
📌 Use Case: Type (target) vs. Price, Area, Rating

📌 Best When: ✅ The dataset contains non-linear relationships. ✅ You need interpretability (can visualize decision paths). ✅ There are missing values in the dataset.

⚠️ Avoid When: ❌ The dataset is small (trees tend to overfit small datasets). ❌ You need high accuracy (Random Forest performs better).

🔧 Industry Applications: Loan Approval (Banking, FinTech) → Deciding whether a person gets Loan Approved, Partially Approved, or Denied. Medical Diagnosis (Healthcare AI) → Classifying patients into Low-Risk, Medium-Risk, High-Risk categories.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# ✅ 1. Load Dataset
data = {
    'Type': ['Apartment', 'Villa', 'Apartment', 'Villa', 'Studio', 'Apartment', 'Studio', 'Villa'],
    'Area': [1200, 1500, 1300, 1600, 1000, 1250, 900, 1550],  # Feature
    'Rating': [4, 5, 4.5, 3.5, 4.2, 4, 3.8, 4.7],  # Feature
    'Is_Active': [1, 0, 1, 1, 0, 1, 0, 1]  # Feature
}

df = pd.DataFrame(data)

# ✅ 2. Convert Target Variable 'Type' into Numerical Labels
label_encoder = LabelEncoder()
df['Type'] = label_encoder.fit_transform(df['Type'])  # (Apartment=0, Studio=1, Villa=2)

# ✅ 3. Define Features (X) and Target (y)
X = df[['Area', 'Rating', 'Is_Active']]
y = df['Type']

# ✅ 4. Split Data into Train & Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ✅ 5. Train Decision Tree Classifier
dt_model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt_model.fit(X_train, y_train)

# ✅ 6. Predict on Test Data
y_pred = dt_model.predict(X_test)

# ✅ 7. Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ✅ 8. Visualize Decision Tree
plt.figure(figsize=(10, 6))
plot_tree(dt_model, feature_names=['Area', 'Rating', 'Is_Active'], class_names=label_encoder.classes_, filled=True)
plt.show()


5️⃣ k-NN Classification
--
📌 Use Case: Type (target) vs. Price, Area, Rating

📌 Best When: ✅ The dataset is small (k-NN is memory-intensive). ✅ Relationships in data are non-linear. ✅ You don’t need to train a model, just store data and classify new points.

⚠️ Avoid When: ❌ The dataset is large (k-NN is slow in predictions). ❌ You have many features (suffers from the "Curse of Dimensionality").

🔧 Industry Applications: Facebook Friend Recommendation → Suggesting new connections based on similar interests, location, interactions. Handwriting Recognition (OCR Systems) → Classifying handwritten letters/numbers.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# ✅ 1. Load Previous Dataset
data = {
    'Review': ["Great place!", "Not worth it.", "Highly recommended.", "Terrible experience.", "Loved it!", "Would not recommend."],
    'Sentiment': [1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative
}

df = pd.DataFrame(data)

# ✅ 2. Convert Text into TF-IDF Features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Review'])  # Convert text to numerical form
y = df['Sentiment']  # Target variable

# ✅ 3. Split Data into Train & Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ✅ 4. Train k-NN Classifier
knn_model = KNeighborsClassifier(n_neighbors=3)  # k=3 (You can tune this)
knn_model.fit(X_train, y_train)

# ✅ 5. Predict on Test Data
y_pred = knn_model.predict(X_test)

# ✅ 6. Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ✅ 7. Test with a New Review
new_review = ["The place was amazing and I loved the experience!"]
new_review_vectorized = vectorizer.transform(new_review)
prediction = knn_model.predict(new_review_vectorized)
print("\nNew Review Sentiment:", "Positive" if prediction[0] == 1 else "Negative")


# NLP

1️⃣ Count Vectorization (Bag-of-Words Model)
---


📌 Use Case: Convert text into a numerical feature matrix.
📌 Best When:
✅ You need a fast and simple way to represent text numerically.
✅ You are working with structured text data (e.g., emails, customer reviews).

⚠️ Avoid When:
❌ You need semantic understanding (BoW ignores meaning).
❌ The dataset has complex word relationships (use word embeddings instead).

🔧 Industry Applications:

Spam Detection (Gmail, Yahoo Mail, Outlook) → Representing emails for classification.
News Categorization (Google News, Bloomberg AI) → Grouping articles into Politics, Sports, Tech.
🛠 Scikit-Learn Example:

python
Copy
Edit
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is a great product!", "I hate this product!", "This is amazing."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
2️⃣ TF-IDF (Term Frequency - Inverse Document Frequency)
---
📌 Use Case: Weight words based on importance in documents.
📌 Best When:
✅ You need to filter out common words and emphasize important terms.
✅ You are working with text classification tasks (e.g., sentiment analysis, spam detection).

⚠️ Avoid When:
❌ You need deep semantic understanding of text.
❌ The dataset is small (TF-IDF works best on larger datasets).

🔧 Industry Applications:

SEO Optimization (Google, Bing, Ahrefs) → Keyword relevance analysis in search engine ranking.
Customer Feedback Analysis (Amazon, Netflix, TripAdvisor) → Identifying important words in reviews.