# Machine Learning: Supervised

## Supervised Learning

![](https://i163.photobucket.com/albums/t281/kyin_album/m1_1.png)

![](https://i163.photobucket.com/albums/t281/kyin_album/m6.png)

# <font color="blue"> Project 1: House Price Prediction

In this example, we'll use a linear regression model to predict housing prices based on 1 feature.

First, make sure you have scikit-learn installed. You can install it using pip:

In [None]:
pip install 

To know more about scikit-learn : [Scikit-learn Documentation](https://scikit-learn.org/stable/)

## Step 1: Prepare the data

In [None]:
# Importing necessary libraries
import numpy as np
from sklearn.model_selection import
from sklearn.linear_model import 
from sklearn.metrics import 

# Sample dataset: housing prices (target) based on house size (feature)
house_sizes = np.array([550, 600, 650, 700, 750, 800, 850, 900, 950, 1000])
prices = np.array([300000, 410000, 530000, 510000, 540000, 610000, 730000, 760000, 830000, 860000])

In [None]:
#plot graph prices against house_prices
import numpy as np
import matplotlib.pyplot as plt   #https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html

# Create a scatter plot
plt.scatter(house_sizes, prices, color='', label='Data Points')
plt.xlabel('House Size')
plt.ylabel('Prices')
plt.title('House Prices vs. House Sizes')
plt.legend()
plt.grid(True)
plt.show()

## Step 2 & 3 : Feature Extraction and Split the data (into training and testing set)

In [None]:
# Reshape the feature array to match the input format required by scikit-learn
X =                         #reshape the array into a single column (1 column) and to infer the number of rows (-1 rows) 
print (X)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, prices, test_size= , random_state=42)
print (X_test)   #these 2 data will be used to test the model
print (y_test)

## Step 4: Fit model and predict outcomes [Code]

In [None]:
# Create the linear regression model
model = 

# Train the model using the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Formatting X_test
formatted_X_test = ["%.2f" % x for x in X_test]
print(formatted_X_test)

# Formatting y_pred
formatted_y_pred = ["%.2f" % y for y in y_pred]
print(formatted_y_pred)

In [None]:
# the following is the equation of the linear regression model

# Print the coefficients
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

# Construct the equation string
equation = f"y = {model.intercept_} + {model.coef_[0]} * X"
print("Linear Regression Equation:", equation)

## Step 5: Evaluate the model [Code]

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)

# Calculate the Mean Squared Error (MSE) as a measure of the model's performancemse = mean_squared_error(y_test, y_pred)
# Display the MSE with two decimal places using string formatting
formatted_mse = "%.2f" % mse
print(f"Mean Squared Error: {formatted_mse}")

# Step 6: Predict unseen data

In [None]:
# Assuming you have a single unseen data point represented as a list
unseen_data = 

# Reshape the unseen_data to a 2D array
unseen_data_reshaped = np.array(unseen_data).reshape(1, -1)

# Predict the target value for the unseen data point
predicted_value = 

# Print the predicted value
print(f"Predicted Value: {predicted_value[0]:.2f}")

# <font color="blue"> Project 2: Spam Prediction

## Step 1: Prepare the data



In [None]:
# make sure the data is labeled
import pandas as pd
data = pd.read_table('',encoding='windows-1252', header=None)
data.columns = ['', '']
print(data.head())
len(data)

In [None]:
# remove words with numbers, punctuation and capital letters
import re
import string

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)  #another method function
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
data['text'] = data.text.map(alphanumeric).map(punc_lower)
print(data.head())

## Step 2: Split the data (into training and testing set)

<Font color="Blue">__Input__: Features, Predictors, Independent Variables, X's
<Font color="orange">__Outputs__: Label, Outcome, Dependent Variable, Y
    
![](https://i163.photobucket.com/albums/t281/kyin_album/m2.png)

In [None]:
# split the data into feature and label
X = data.text # inputs into model
y = data.label # output of model

In [None]:
X.head()

In [None]:
y.head()

## Overfitting

![](https://i163.photobucket.com/albums/t281/kyin_album/m3.png)

![](https://i163.photobucket.com/albums/t281/kyin_album/m4.png)

# Split the data [Code]

In [None]:
# split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# test size = 30% of observations, which means training size = 70% of observations
# random state = 42, so we all get the same random train / test split

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
y_train.head()

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

## Step 3: Numerically encode the input data [Code]

In [None]:
X_test

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='')
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes
# print the dimensions of the training set (text messages, terms)
print(X_train_cv.toarray().shape)

In [None]:
import joblib
joblib.dump(cv, 'countvectorizer.joblib')

In [None]:
type(X_train_cv)
import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(X_test_cv)

In [None]:
help(cv.fit_transform)

## Step 4: Fit model and predict outcomes [Code]

In [None]:
# Use a logistic regression model (categorical)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(X_train_cv, y_train) #= train

# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
y_pred_cv = lr.predict(X_test_cv)
y_pred_cv # The output is all of the predictions/ labels

## Step 5: Evaluate the model

![](https://i163.photobucket.com/albums/t281/kyin_album/m5.png)

# Step 5: Evaluate the model [Code]

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
# import matplotlib.pyplot as plt
# %matplotlib inline
#Provided you are running IPython, the %matplotlib inline will make your plot outputs appear and be stored within the notebook
cm = confusion_matrix(y_test, y_pred_cv)   #y_test is the label of testing data, y_pred_cv is the predicted ans from the ML with testing set
sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

## Step 6: Predict new input [Code]

In [None]:
import pandas as pd
from pandas import DataFrame
Sentence1 = "Free entry in 2 a wkly comp to win FA Cup final tkts 21st"
df = pd.DataFrame(columns=[Sentence1])
df.head()

Snew = cv.transform(df)
result = lr.predict(Snew)

print(result)

# <font color="blue"> __Naive Bayes__

# Naive Bayes [code]

In [None]:
# Use a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
import numpy as np
nb = MultinomialNB()
# Train the model

# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv




# Naive Bayes: Results

In [None]:
cm = confusion_matrix(y_test, y_pred_cv_nb)
sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

NBscore = nb.score(X_test_cv, y_test)

# <font color="blue"> Project 3: Review Rating Prediction

# Introduction

We will be using the same review data set from Kaggle for this exercise. The product we'll focus on this time is a cappuccino cup.

The following code will help you load in the data.


In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('')
data.head()

### Question 1

- Determine how many reviews there are in total.


Use the preprocessing code below to clean the reviews data before moving on to modeling.


In [None]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
import re
import string

alphanumeric = 
punc_lower =

data['reviews'] = data.reviews.map(alphanumeric).map(punc_lower)
data.head()

In [None]:
type(data)

### Question 2: Classsification *(20% testing, 80% training)*

Processes for classification

### <font color="Blue">Step 1:</font> Prepare the data (identify the feature and label)

In [None]:
# split the data into feature and label
X = data.reviews # inputs into model
y = data.stars # output of model

X.head()

### <font color="Blue">Step 2:</font> Split the data into training and testing sets

In [None]:
# split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=, random_state=42)
# test size = 30% of observations, which means training size = 70% of observations
# random state = 42, so we all get the same random train / test split

In [None]:
X_test.head()

In [None]:
y_test.head()

### <font color="Blue">Step 3:</font> Vectorize the feature

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='')
X_train_cv = 
X_test_cv =                  # transform uses the same vocab and one-hot encodes
# print the dimensions of the training set (text messages, terms)
print(X_train_cv.toarray().shape)

### <font color="Blue">Step 4:</font> Idenfity the model/ classifier to be used. Feed the train data into the model

### - Linear Regression

In [None]:
# Use a linear regression model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv



In [None]:
# Use a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv


### - SVM

In [None]:
from sklearn.svm import LinearSVC
svc = LinearSVC()

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv



### - Decision Tree

In [None]:
from sklearn import tree
tree = tree.DecisionTreeClassifier()

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv


### - Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth=10, random_state=0)

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv


### - KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv


### -  Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
import numpy as np
nb = MultinomialNB()

# Train the model


# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv


### <font color="Blue">Step 5:</font> Evaluate the Model - Accuracy Measurement
Generate the accuracy scores for Linear Regression, SVM, Decision Tree, Random Forest, KNN, and Naive Bayes.  

In [None]:
from sklearn.metrics import accuracy_score
lra = accuracy_score(y_test, y_pred_LR)
svm = accuracy_score(y_test, y_pred_svm)
dt = accuracy_score(y_test, y_pred_dt)
rf = accuracy_score(y_test, y_pred_rf)
knn = accuracy_score(y_test, y_pred_knn)
nb = accuracy_score(y_test, y_pred_nb)

print("Accuracy score for LR: %.2f" % lra)
print("Accuracy score for SVM: %.2f" % svm)
print("Accuracy score for DT: %.2f" % dt)
print("Accuracy score for RF: %.2f" % rf)
print("Accuracy score for KNN: %.2f" % knn)
print("Accuracy score for NB: %.2f" % nb)

__Example Output:__
- Accuracy score for LR  = 0.1651
- Accuracy score for SVM = 0.5413
- Accuracy score for DT  = 0.5505
- Accuracy score for RF  = 0.5872
- Accuracy score for KNN = 0.5963
- Accuracy score for NB  = 0.6514

### Question 3
Predict the rate of this review,

<font color="blue">__"like Cafe Vienna instant coffee products with the convenience of Keurig. All authorized on-line sellers cannot carry them"__



by using Linear Regression, SVM, Decision Tree, Random Forest, KNN, and Naive Bayes

In [None]:
import pandas as pd
from pandas import DataFrame
S2 = ""
df = pd.DataFrame(columns=[])
df.head()

Snew = cv.transform(df)
lry = lr.predict(Snew)
print(lry)

In [None]:
import pandas as pd
from pandas import DataFrame

user_input = input("Enter a sentence: ")
df = pd.DataFrame(columns=[])
df.head()

Snew = cv.transform(df)
lry = lr.predict(Snew)
print(lry)