# Predictions

In this notebook you can fit a model to the data and make predictions. Since we want you to figure things out on your own, we will only introduce some elementary models and techniques. You are encouraged to explore more advanced models and techniques on your own.

You should always split your data into training and testing sets before fitting a model. This ensures that you can evaluate the performance of your model on unseen data. You can use the `train_test_split` function from the `sklearn.model_selection` module to split your data.

The models we will introduce in this notebook are:
- Linear regression (Supervised learning)
- Random forest classifier (Supervised learning)
- K-nearest neighbors classifier (Supervised learning)
- K-means clustering (Unsupervised learning)
- DBSCAN clustering (Unsupervised learning)

Feel free to explore other models and techniques as well. The goal of this notebook is to give you a starting point for making predictions with your data.

You may notice the steps for fitting the models and making predictions are quite similar. This is because the process of fitting a model and making predictions is generally the same regardless of the model you are using. The main difference is in the type of model you are using and the specific parameters you need to set for that model.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

In [None]:
# Load the processed data
DATA_PATH = "../data/processed/processed_data.csv"
data = pd.read_csv(DATA_PATH)

In [None]:
# Linear regression
# For linear regression, we will use the features "feature1" and "feature2" to predict the target variable "target". You can replace these with the actual feature names from your dataset.
regression_data = data[["feature1", "feature2", "target"]]
X = regression_data[["feature1", "feature2"]]
y = regression_data["target"]

# Split the data into training and testing sets
train, test = train_test_split(regression_data, test_size=0.2, random_state=42)

# Fit the linear regression model and make predictions
model = LinearRegression()
model.fit(train[["feature1", "feature2"]], train["target"])
predictions = model.predict(test[["feature1", "feature2"]])

# Check the accuracy of the model
accuracy_score(test["target"], predictions)

In [None]:
# Random forest classifier
# For random forest classifier, we will use the features "feature1" and "feature2" to predict the target variable "target". You can replace these with the actual feature names from your dataset.
# To choose all columns except the target column, you can use classification_data.drop("target", axis=1)
classification_data = data[["feature1", "feature2", "target"]]
X = classification_data[["feature1", "feature2"]]
y = classification_data["target"]

# Like before, split the data into training and testing sets
train, test = train_test_split(regression_data, test_size=0.2, random_state=42)

# Fit the random forest classifier and make predictions
model = RandomForestClassifier()
model.fit(train[["feature1", "feature2"]], train["target"])
predictions = model.predict(test[["feature1", "feature2"]])

# Check the accuracy of the model
accuracy_score(test["target"], predictions)

In [None]:
# K-nearest neighbors classifier
# For K-nearest neighbors classifier, we will use the features "feature1" and "feature2" to predict the target variable "target". You can replace these with the actual feature names from your dataset.
# To choose all columns except the target column, you can use classification_data.drop("target", axis=1)
classification_data = data[["feature1", "feature2", "target"]]
X = classification_data[["feature1", "feature2"]]
y = classification_data["target"]

# Like before, split the data into training and testing sets
train, test = train_test_split(regression_data, test_size=0.2, random_state=42)

# Fit the K-nearest neighbors classifier and make predictions
model = KNeighborsClassifier()
model.fit(train[["feature1", "feature2"]], train["target"])
predictions = model.predict(test[["feature1", "feature2"]])

# Check the accuracy of the model
accuracy_score(test["target"], predictions)

In [None]:
# K-means clustering
# For K-means clustering, we will use the features "feature1" and "feature2" to cluster the data. You can replace these with the actual feature names from your dataset.
clustering_data = data[["feature1", "feature2"]]

# Find the optimal number of clusters using the elbow method
for k in range(1, 10):
    model = KMeans(n_clusters=k)
    model.fit(clustering_data)
    plt.plot(k, model.inertia_, "bo")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.show()

# Examine the plot and choose the optimal number of clusters (the point where the inertia starts to decrease significantly). In this example, we will use 3 clusters. You can change this number based on your analysis of the elbow plot.

# Fit the K-means clustering model and make predictions
model = KMeans(n_clusters=3)
model.fit(clustering_data)
predictions = model.predict(clustering_data)

# Check the accuracy of the model (since this is unsupervised learning, we will not have a target variable to compare against. Instead, we can visualize the clusters to see if they make sense.)
plt.scatter(clustering_data["feature1"], clustering_data["feature2"], c=predictions, cmap="viridis")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("K-means Clustering")
plt.show()

In [None]:
# DBSCAN clustering
# For DBSCAN clustering, we will use the features "feature1" and "feature2" to cluster the data. You can replace these with the actual feature names from your dataset.
clustering_data = data[["feature1", "feature2"]]

# Fit the DBSCAN clustering model and make predictions
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(clustering_data)
predictions = model.labels_

# Check the accuracy of the model (since this is unsupervised learning, we will not have a target variable to compare against. Instead, we can visualize the clusters to see if they make sense.)
plt.scatter(clustering_data["feature1"], clustering_data["feature2"], c=predictions, cmap="viridis")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("DBSCAN Clustering")