# Lab: 16 - Machine Learning Intro
## text_classifier - Mohammed Al-Hanbali - 31/10/2021

### **NOTE:**
**For the sake of practice, I created a separate model for each data set instead of creating one function that applies to all data sets as was done in the tutorial.**

In [198]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Reading data from sources

In [199]:
filepath = {
    "yelp": "yelp_labelled.txt",
    "amazon": "amazon_cells_labelled.txt",
    "imdb": "imdb_labelled.txt",
}

data_list = []

for source, path in filepath.items():
    review_data = pd.read_csv(path, names = ["sentences", "label"], sep = "\t")
    review_data["source"] = source
    data_list.append(review_data)


review_data = pd.concat(data_list)

### Separating Data By Source

In [200]:
yelp_data = review_data[review_data["source"] == "yelp"]
amazon_data = review_data[review_data["source"] == "amazon"]
imdb_data = review_data[review_data["source"] == "imdb"]

### Reshaping Data

In [201]:
yelp_X = yelp_data["sentences"].values
yelp_y = yelp_data["label"].values

amazon_X = amazon_data["sentences"].values
amazon_y = amazon_data["label"].values

imdb_X = imdb_data["sentences"].values
imdb_y = imdb_data["label"].values

### Splitting Data

In [202]:
yelp_x_train, yelp_x_test, yelp_y_train, yelp_y_test = train_test_split(yelp_X, yelp_y, test_size = 0.3, random_state = 50)

amazon_x_train, amazon_x_test, amazon_y_train, amazon_y_test = train_test_split(amazon_X, amazon_y, test_size = 0.3, random_state = 80)

imdb_x_train, imdb_x_test, imdb_y_train, imdb_y_test = train_test_split(imdb_X, imdb_y, test_size = 0.3, random_state = 45)

### Vectorizing Data

In [203]:
vectorize_yelp = CountVectorizer(min_df = 0, lowercase=False)
vectorize_yelp.fit(yelp_x_train)

yelp_X_train = vectorize_yelp.transform(yelp_x_train)
yelp_X_test = vectorize_yelp.transform(yelp_x_test)



vectorize_amazon = CountVectorizer(min_df = 0, lowercase=False)
vectorize_amazon.fit(amazon_x_train)

amazon_X_train = vectorize_amazon.transform(amazon_x_train)
amazon_X_test = vectorize_amazon.transform(amazon_x_test)



vectorize_imdb = CountVectorizer(min_df = 0, lowercase=False)
vectorize_imdb.fit(imdb_x_train)

imdb_X_train = vectorize_imdb.transform(imdb_x_train)
imdb_X_test = vectorize_imdb.transform(imdb_x_test)



### Creating Classification Models

In [204]:
yelp_classifier = LogisticRegression()
yelp_classifier.fit(yelp_X_train, yelp_y_train)
yelp_score = yelp_classifier.score (yelp_X_test, yelp_y_test)


amazon_classifier = LogisticRegression()
amazon_classifier.fit(amazon_X_train, amazon_y_train)
amazon_score = amazon_classifier.score (amazon_X_test, amazon_y_test)

imdb_classifier = LogisticRegression()
imdb_classifier.fit(imdb_X_train, imdb_y_train)
imdb_score = imdb_classifier.score (imdb_X_test, imdb_y_test)

print(f"""
Yelp Score: {yelp_score}
Amazon Score: {amazon_score}
IMDB Score: {imdb_score}
""")


Yelp Score: 0.73
Amazon Score: 0.81
IMDB Score: 0.72



### Practical Testing

In [205]:
yelp_test_sentence_1 = ["Cold food", "Slow service"]

prediction_1 = vectorize_yelp.transform(yelp_test_sentence_1)
yelp_classifier.predict(prediction_1)


array([0, 0])

In [206]:
yelp_test_sentence_2 = ["Great meal", "Properly cooked meat"]

prediction_2 = vectorize_yelp.transform(yelp_test_sentence_2)

yelp_classifier.predict(prediction_2)

array([1, 1])