# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename, header=0)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I decided to use "AI vs Human Text" by SHAYAN GERAMI on Kaggle.
2. It will predict which text is generated by AI or not
3. This is a supervised learning binary classification problem
4. The features are text data in a column, and the label would be if it's generated or not (0 or 1)
5. Many people have been using ChatGPT or other AI models for help, but sometimes they use it to complete assignments fully of it which is a breach of academic policies. This goes further than school, and professional careers such as research literature are noticing the effects of AI written text.
---- Extra. I wanted to initially use this but dataset file was too large and didn't particulary know how to understand the data.
1. I decided to use the dataset "AI-ArtBench" by Ravidu Silva and Jordan J. Bird from Kaggle. I chose this one because it's frequently updated and has 600 downloads.
2. I want to be able to predict given a piece of art whether or not it is AI
3. This would be a supervised classification problem. It will be binary classification, though it's interesting to do multi-class (maybe they drew it and used AI for help, or even traced something!) in the future.
4. Features are different art genres i.e. impressionalism, renaissance
5. It is extremely common in recent times for people to use AI for art. Many times the art used to make said AI art is from actual artists who do this for a living. Being able to identify which is AI ensures no one is being scammed (it's common for AI artists to claim they made the art by hand and make customers pay hundreds of dollars), and in the future identify possible artists that the AI took from, to alert them.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
humanvsAItext_filename = os.path.join("data", "AI_Human.csv") 
get_data = pd.read_csv(humanvsAItext_filename, header=0, nrows=38152)
df = get_data.sample(n=10000, random_state=1234) #Cropping data to randomized 10000 - smaller data set for memory purposes

In [4]:
print(df.shape)
print(list(df.columns))
df.head(10)

(10000, 2)
['text', 'generated']


Unnamed: 0,text,generated
18105,"A Cowboy Who Rode the Waves\n\nIn the wild, va...",1.0
4909,The use of Facial Action Coding System to read...,0.0
22784,"The Electoral College system, which is the pro...",1.0
22770,The electoral college is a system in the Unite...,1.0
26873,"His adventures sounded exciting, but I was cur...",1.0
2495,car usage should be limited across the world t...,0.0
3936,The new technology called Facial Action Coding...,0.0
9587,\nOnline schooling has become an increasingly ...,1.0
22167,Car-free cities: An Eco-Friendly Future?\n\nIm...,1.0
12342,The study of ocean acidification has become in...,1.0


In [5]:
#0 means human written, 1 means AI generated
ai_generated_count = (df['generated'] == 1.0).sum()
human_generated_count = (df['generated'] == 0.0).sum()
print("AI generated texts:", ai_generated_count)
print("Human written texts :", human_generated_count)

AI generated texts: 5533
Human written texts : 4467


In [6]:
df.describe()

Unnamed: 0,generated
count,10000.0
mean,0.5533
std,0.497176
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1. Label x and y and do train_test_split()
2. Tranform text using vectorizer
3. Build neural network with Keras OR use logistic regression (maybe neural network so we can find more complex patterns in the generated text)

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
import time

2024-08-03 12:13:15.056366: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-08-03 12:13:15.056398: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [8]:
y = df['generated'].astype(bool)
X = df['text']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1234)

In [9]:
#Transforming text - TDF vector
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
vocab_size = len(tfidf_vectorizer.vocabulary_)
print(vocab_size)

32711


In [10]:
nn_model = keras.Sequential()
input_layer = keras.layers.InputLayer(input_shape=(vocab_size,))
nn_model.add(input_layer)

nn_model.add(keras.layers.Dropout(.25))

hidden_layer_1 = keras.layers.Dense(units=64, activation='relu')
nn_model.add(hidden_layer_1)

nn_model.add(keras.layers.Dropout(.25))

hidden_layer_2 = keras.layers.Dense(units=32, activation='relu')
nn_model.add(hidden_layer_2)

nn_model.add(keras.layers.Dropout(.25))

hidden_layer_3 = keras.layers.Dense(units=16, activation='relu')
nn_model.add(hidden_layer_3)


output_layer = keras.layers.Dense(units=1, activation='sigmoid')
nn_model.add(output_layer)

nn_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 32711)             0         
_________________________________________________________________
dense (Dense)                (None, 64)                2093568   
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 1

2024-08-03 12:13:22.471474: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2024-08-03 12:13:22.471505: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2024-08-03 12:13:22.471581: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (i-095f281bde303de91): /proc/driver/nvidia/version does not exist
2024-08-03 12:13:22.471823: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [11]:
sgd_optimizer = keras.optimizers.SGD(learning_rate=0.1)
loss_fn = keras.losses.BinaryCrossentropy(from_logits=False)
nn_model.compile(optimizer=sgd_optimizer, loss=loss_fn, metrics=['accuracy'])

In [12]:
#Referenced from previous assignments
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v) for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))

In [13]:
start_time = time.time()
num_epochs = 35
history = nn_model.fit(X_train_tfidf.toarray(), y_train, epochs=num_epochs, verbose=0, validation_split=0.2, callbacks=[ProgBarLoggerNEpochs(num_epochs, every_n=5)])
end_time = time.time()
print('Elapsed time: %.2fs' % (end_time-start_time))

2024-08-03 12:13:24.942459: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2024-08-03 12:13:24.947188: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2649995000 Hz


Epoch [5/ 35], Loss: 0.0702, Accuracy: 0.9763, Val_loss: 0.0554, Val_accuracy: 0.9833
Epoch [10/ 35], Loss: 0.0296, Accuracy: 0.9905, Val_loss: 0.0440, Val_accuracy: 0.9853
Epoch [15/ 35], Loss: 0.0247, Accuracy: 0.9928, Val_loss: 0.0390, Val_accuracy: 0.9853
Epoch [20/ 35], Loss: 0.0103, Accuracy: 0.9972, Val_loss: 0.0391, Val_accuracy: 0.9867
Epoch [25/ 35], Loss: 0.0633, Accuracy: 0.9845, Val_loss: 0.0424, Val_accuracy: 0.9853
Epoch [30/ 35], Loss: 0.0090, Accuracy: 0.9970, Val_loss: 0.0474, Val_accuracy: 0.9867
Epoch [35/ 35], Loss: 0.0066, Accuracy: 0.9975, Val_loss: 0.0407, Val_accuracy: 0.9873
Elapsed time: 35.98s


Initial version of NN output (num_epochs = 55):\
Epoch [5/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0664, Val_accuracy: 0.9867 \
Epoch [10/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0669, Val_accuracy: 0.9867\
Epoch [15/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0675, Val_accuracy: 0.9867\
Epoch [20/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0680, Val_accuracy: 0.9867\
Epoch [25/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0684, Val_accuracy: 0.9867\
Epoch [30/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0689, Val_accuracy: 0.9867\
Epoch [35/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0692, Val_accuracy: 0.9867\
Epoch [40/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0697, Val_accuracy: 0.9867\
Epoch [45/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0700, Val_accuracy: 0.9867\
Epoch [50/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0703, Val_accuracy: 0.9867\
Epoch [55/ 55], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0706, Val_accuracy: 0.9867\
Elapsed time: 32.04s\
Output appears to have now val lost but high accuracy, which is a good sign. It might imply overfitting since there is no Loss in general.
Second edition (num_epochs = 35)\
Epoch [5/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0710, Val_accuracy: 0.9867\
Epoch [10/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0712, Val_accuracy: 0.9867\
Epoch [15/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0715, Val_accuracy: 0.9867\
Epoch [20/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0718, Val_accuracy: 0.9867\
Epoch [25/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0721, Val_accuracy: 0.9867\
Epoch [30/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0723, Val_accuracy: 0.9867\
Epoch [35/ 35], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.0725, Val_accuracy: 0.9867\
Elapsed time: 21.83s\
More val_loss, but predicting the same amount with val_accuracy - will also add a drop out layers in between all hidden layers
Third Edition (Added dropout layers, got rid of 4th hidden layer): \
Epoch [5/ 35], Loss: 0.0749, Accuracy: 0.9798, Val_loss: 0.0821, Val_accuracy: 0.9727\
Epoch [10/ 35], Loss: 0.0252, Accuracy: 0.9938, Val_loss: 0.0775, Val_accuracy: 0.9760\
Epoch [15/ 35], Loss: 0.0285, Accuracy: 0.9930, Val_loss: 0.0514, Val_accuracy: 0.9867\
Epoch [20/ 35], Loss: 0.0246, Accuracy: 0.9930, Val_loss: 0.0692, Val_accuracy: 0.9807\
Epoch [25/ 35], Loss: 0.1042, Accuracy: 0.9693, Val_loss: 0.0584, Val_accuracy: 0.9813\
Epoch [30/ 35], Loss: 0.0096, Accuracy: 0.9977, Val_loss: 0.0663, Val_accuracy: 0.9840\
Epoch [35/ 35], Loss: 0.0063, Accuracy: 0.9985, Val_loss: 0.0730, Val_accuracy: 0.9853\
Elapsed time: 23.13s \
This one to me makes more sense for the data. Having no Loss was a big concerning. Will now move one of the drop out layers after input

In [14]:
loss, accuracy = nn_model.evaluate(X_test_tfidf.toarray(), y_test)
print('Loss: ', str(loss) , 'Accuracy: ', str(accuracy))

Loss:  0.03994454815983772 Accuracy:  0.9887999892234802


In [15]:
predictions = nn_model.predict(X_test_tfidf.toarray())

In [17]:
print(X_test.to_numpy()[11])
isAI = True if predictions[11] >= .5 else False
print('\nPrediction: Is this a AI generated? {}\n'.format(isAI))
print('Actual: Is this AI generated? {}\n'.format(y_test.to_numpy()[11]))

Technology is becoming a bigger part of our lives. As time goes on, we develop new technology for different purposes. One such purpose is to read emotions. This technology will benefit our society in different ways, like in education. The technology to read students emotions is valuable to our schools.

The first way in which this technology is vaulable to schools is by improving lessons. Often times, many students of different calibers are thrust into the same classroom. While some grow bored others become confused by the content of the lesson. In the article, Dr. Huang is quoted saying, "A classroom computer could recognize when a student is becoming confused or bored," and then "Then it could modify the lesson, like an effective human instructor. If we implemented this emotion recognition technology, we could help students of all levels grow and reach their potential.

Another reason why technology that can recognize emotions is vaulable in schools is because of fighting and bullyi