
# Cars and cameras opinion model



### Two models are presented to determine if an opinion is about a camera or a car


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn import svm


## Importing the data


The data used is obtained from https://www.kaggle.com/jyotiprasadpal/eopinionscom-product-reviews

In [2]:
#Defining the pandas from the csv 
df = pd.read_csv('cars-and-cameras-opinions.csv')


## Analyzing and optimizing the data


In [3]:
df
#df.info

Unnamed: 0,class,text
0,Auto,I have recently purchased a J30T with moderat...
1,Camera,I bought this product because I need instant ...
2,Auto,I have owned my Buick since 53000 km and I am...
3,Camera,This was my first Digital camera so I did qui...
4,Camera,Minolta DiMAGE 7Hi is in a digital SLR with 5...
...,...,...
595,Auto,Recently our 12 year old Nissan Stanza decide...
596,Camera,I always do a lot of research before I buy an...
597,Auto,This car is an all around good buy If you ar...
598,Auto,I waited to write this until I have had 4 mon...



### Checking if we have null data in the dataframe


In [4]:
bol_null = df.isnull() #True if the value is null, False if the value is not null
check_null = bol_null.any(axis=1) #Check if there is at least one null in an instance
return_null = df[check_null] #Returns the rows which has null values
print(return_null)

Empty DataFrame
Columns: [class, text]
Index: []



***An empty frame is returned, so we are working with nice complete data***


***We can also ask if there are null values in the following way***

In [5]:
df.isnull().values.any()

False

In [6]:
#Total of null values in each column
df.isnull().sum()

class    0
text     0
dtype: int64

In [7]:
#Total sum of null values
df.isnull().sum().sum()

0

### Studying the data

In [8]:
df.loc[0]

class                                                 Auto
text      I have recently purchased a J30T with moderat...
Name: 0, dtype: object

In [9]:
pd.set_option("display.max_colwidth", None) #If we want to read a complete opinion
df.loc[0]

class                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   


***We are also interested in how many words an opinion have***

***The model will be better if we have larger opinions***


In [10]:
#We can use the space to count words, but if there are many spaces together, it will count them as three words
#df['text'].str.count(' ')  

In [11]:
#To be more precise, we will use the next command to count how many words are in each opinion
df['text'].str.split().str.len()

0      221
1      536
2      278
3      202
4      974
      ... 
595    408
596    341
597    158
598    461
599    237
Name: text, Length: 600, dtype: int64

In [12]:
#If we want to know what is the mean number of words used in an opinion, we run the next command
df['text'].str.split().str.len().mean()

529.8733333333333

In [13]:
#A more general description of the data
df['text'].str.split().str.len().describe()

count     600.000000
mean      529.873333
std       511.702057
min        80.000000
25%       237.000000
50%       372.000000
75%       620.500000
max      4982.000000
Name: text, dtype: float64


***This means we are training the data with large descriptions***

***The model will be efficient with large descriptions but not with short descriptions***


In [14]:
#We now want to know how many opinions we have of each, cameras or cars
df['class'].value_counts()

Camera    350
Auto      250
Name: class, dtype: int64

***This is an acceptable proportion which can be effective for cameras and cars***


## Model 1 - Decision Tree Classifier: Creating, training and implementing the model


***We will implement classification models, as we want to classify between cameras and cars***

In [15]:
#We have 600 opinions. We will use 70% for training and 30% for testing
#We will give a random state and shuffle in order to avoid training the data with biased number of opinion

train, test = train_test_split(df, test_size = 0.3, random_state = 128)


### Verifying the split requests


In [16]:
pd.reset_option("display.max_colwidth") #To reset the width of the column
train.head(2)

Unnamed: 0,class,text
371,Camera,About a year ago I purchased a 775 from Dell...
421,Auto,The body of this car hasn t fallen apart whil...


In [17]:
test.head(2)

Unnamed: 0,class,text
144,Auto,Where to begin I bought this car on the recom...
436,Camera,I purchased this digital camera because I am...


***We confirm the data is shuffled***

In [18]:
len(train) #420 is the 70% of 600

420

In [19]:
len(test) #180 is the 30% of 600

180

In [20]:
train['class'].value_counts() #The sum of camera + auto counts gives 420

Camera    246
Auto      174
Name: class, dtype: int64

In [21]:
test['class'].value_counts() #The sum of camera + auto counts gives 180

Camera    104
Auto       76
Name: class, dtype: int64


### Dividing the data and storing it training and testing variables


In [22]:
X_train = train['text'].to_list()
y_train = train['class'].to_list()
X_test = test['text'].to_list()
y_test = test['class'].to_list()


### CountVectorizer from sklearn


***To classify, we will use a bag of words model, which will classify the label by studying the words in a list***

In [23]:
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train) #Learn the vocabulary dictionary and return document-term matrix
X_test_vectors = vectorizer.transform(X_test) #Transform documents to document-term matrix

In [24]:
#This tells us how many different words we have from the train data
X_train_vectors.shape

(420, 11316)

In [25]:
#An example of what we have in the first opinion
print(X_train_vectors[0])

  (0, 900)	1
  (0, 11261)	1
  (0, 1112)	1
  (0, 7973)	1
  (0, 777)	2
  (0, 4571)	2
  (0, 3212)	1
  (0, 847)	1
  (0, 7507)	6
  (0, 5898)	1
  (0, 4968)	1
  (0, 11148)	6
  (0, 10152)	18
  (0, 6614)	1
  (0, 9507)	1
  (0, 7050)	1
  (0, 10189)	6
  (0, 2114)	9
  (0, 466)	1
  (0, 1090)	1
  (0, 7790)	1
  (0, 6342)	1
  (0, 1750)	1
  (0, 1049)	1
  (0, 10291)	9
  :	:
  (0, 2110)	1
  (0, 4714)	1
  (0, 10731)	1
  (0, 6257)	1
  (0, 2656)	1
  (0, 5770)	1
  (0, 1120)	1
  (0, 10013)	2
  (0, 5299)	2
  (0, 1165)	1
  (0, 10018)	1
  (0, 2735)	1
  (0, 7623)	1
  (0, 9083)	1
  (0, 10159)	1
  (0, 7103)	1
  (0, 7856)	1
  (0, 7465)	1
  (0, 4319)	1
  (0, 10728)	1
  (0, 7660)	1
  (0, 6817)	1
  (0, 11201)	1
  (0, 10014)	1
  (0, 10600)	1



### Implementing the Decision Tree Classifier algorithm from sklearn



***Decision tree will permit us classify each opinion***


In [26]:
#Defining our classifier using the decision tree
clf_dec = DecisionTreeClassifier()

#We will use our the transformation we used from the bag of words
clf_dec.fit(X_train_vectors, y_train) #Build a decision tree classifier from the training set (X,y)

DecisionTreeClassifier()


### Testing the model


In [27]:
#We will observe the first test value, which seems to be a car
X_test[0]

' Where to begin I bought this car on the recommendation of a friend who has a pretty serious love affair with Volvo cars  For the first 5   6 months  the car was fine  But then   Paint Chips  Within the first 10 months  I noticed dime sized paint chips on the front of the car  Unfortunately  I didn t purchase the clear coat auto bra at the time of original purchase  To me  this was totally unacceptable  I took it to my dealer and had to argue with the service director to get it fixed  paint is not warranted  In fact  they wouldn t fix it at first  It was only when I responded to the Volvo service survey with my dissatisfaction that they agreed to fix the paint chips  During my original conversations with Curtis at Rickenbaugh Volvo  he indicated that the problem was likely my fault  asking if I follow other cars too closely Headlights  The headlights go out about every 6 months  Annoying Brakes  The car was in for work on the brakes twice within 34K miles  The power assist on the brak

In [28]:
#We predict if this last opinion is a camera or an auto with the model we have made
clf_dec.predict(X_test_vectors[0])

array(['Auto'], dtype='<U6')

***The model successfully predicts that the opinion is from an auto***


## Accuracy of the model


We will obtain the accuracy of the model in three different ways


### Method 1: Score


In [29]:
#We are interested in the accuracy of the model, so first we will use the score method
clf_dec.score(X_test_vectors, y_test) #This will return the mean accuracy on the given test data and labels

0.9944444444444445


***The score is really good because we are using good data for the training***



### Method 2: F1 score 


In [30]:
#F1 score can be interpreted as a harmonic mean of the precision and recall
#Best score is 1 and worst is 0
f1_score(y_test, clf_dec.predict(X_test_vectors), average = None)

array([0.99346405, 0.99516908])


***The score is around 99% for each class***



### Method 3: Confusion matrix


In [31]:
#We can use the confusion matrix also
confusion_matrix(y_test, clf_dec.predict(X_test_vectors))

array([[ 76,   0],
       [  1, 103]], dtype=int64)


***76 true positives, 103 true negatives and 1 false negative***



## Testing the model


In [32]:
#We can use a new list of opinions as a test
#The last four opinions are obtained from internet
#The model should predict: Camera, Auto, Auto, Auto, Camera, Auto

extra_test = ['I like my camera', 'My car is not working', 
               "The Nissan Armada isnt the newest large SUV, and its not very fuel efficient or steady on the highway, but its one of the more elegant and well equipped.",
               "I will never purchase another Chevy vehicle ever again. Purchased an equinox a few years back probably a month later started having engine issues within a few months had to replace the engine all together then a few months later the brand new engine started to deteriorate and break down as well. Turns out Chevy is quite aware of the terrible issue they have with equinox and their engines however they refuse to do anything about it. Then shortly after my sister bought a Chevy truck almost brand new and shortly after that she had to replace the entire transmission for the vehicle. The Chevy brand used to be a reliable brand and a trustworthy vehicle for your family. Now there are no family values they don't care and you're on your own to deal with issues the issues.",
               "This was purchased for my daughter and shes 8. She likes it and takes pics and videos. Battery last long and comes with SD card. Picture quality is good and so is video",
               "Excellent acquisition, I was pleasantly surprised, the structure and suspension are dynamic, the color is what I expected and the interiors are very comfortable, my suggestion is that the audio system be a little more innovative"]
new_test = vectorizer.transform(extra_test)
clf_dec.predict(new_test)

array(['Camera', 'Auto', 'Auto', 'Auto', 'Auto', 'Auto'], dtype='<U6')


***The test fails with the last prediction, resulting in 4/5 predictions correct***



## Model 2: SVM



***We will be using the supervised ML algorithm of Support Vector Machine (SVM)***

***We will use C-Support Vector Classification with a linear kernel***


In [33]:
#We will use the new classifier using SVC
clf_svm = svm.SVC(kernel = 'linear')
clf_svm.fit(X_train_vectors, y_train)

SVC(kernel='linear')


### Accuracy of the optimized model: score


In [34]:
#We use score again to test the accuracy of the new model
clf_svm.score(X_test_vectors, y_test)

1.0


***The accuracy is 100% with the test, as we have very good data***



### Testing


In [35]:
clf_svm.predict(new_test) 

array(['Auto', 'Auto', 'Auto', 'Auto', 'Auto', 'Auto'], dtype='<U6')


***Even though the model is more accurate with the testing data, it fails to predict correctly the new_test data***



***This way, the model was unable to detect the camera opinions***



## Corrections to the model



***We can make some corrections to the model***


In [36]:
train, test = train_test_split(df, test_size = 0.33, random_state = 42)
X_train = train['text'].to_list()
y_train = train['class'].to_list()
X_test = test['text'].to_list()
y_test = test['class'].to_list()


***Now we use 67% of the data to train and 33% to test***

***We also change the random state***



### Decision tree classifier model with new split


In [37]:
#Count vectorizer
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test) 

#Decision tree classifier
clf_dec = DecisionTreeClassifier()
clf_dec.fit(X_train_vectors, y_train)

clf_dec.score(X_test_vectors, y_test) 

0.9949494949494949

In [38]:
extra_test = ['I like my camera', 'My car is not working', 
               "The Nissan Armada isnt the newest large SUV, and its not very fuel efficient or steady on the highway, but its one of the more elegant and well equipped.",
               "I will never purchase another Chevy vehicle ever again. Purchased an equinox a few years back probably a month later started having engine issues within a few months had to replace the engine all together then a few months later the brand new engine started to deteriorate and break down as well. Turns out Chevy is quite aware of the terrible issue they have with equinox and their engines however they refuse to do anything about it. Then shortly after my sister bought a Chevy truck almost brand new and shortly after that she had to replace the entire transmission for the vehicle. The Chevy brand used to be a reliable brand and a trustworthy vehicle for your family. Now there are no family values they don't care and you're on your own to deal with issues the issues.",
               "This was purchased for my daughter and shes 8. She likes it and takes pics and videos. Battery last long and comes with SD card. Picture quality is good and so is video",
               "Excellent acquisition, I was pleasantly surprised, the structure and suspension are dynamic, the color is what I expected and the interiors are very comfortable, my suggestion is that the audio system be a little more innovative"]
new_test = vectorizer.transform(extra_test)
clf_dec.predict(new_test)

array(['Camera', 'Auto', 'Auto', 'Auto', 'Auto', 'Auto'], dtype='<U6')


***Decision tree classifier remains failling 1/6 predictions for new_test***



### SVM model with new split


In [39]:
#Support vector classifier
clf_svm = svm.SVC(kernel = 'linear')
clf_svm.fit(X_train_vectors, y_train)
clf_svm.score(X_test_vectors, y_test)

1.0

In [40]:
clf_svm.predict(new_test)

array(['Camera', 'Auto', 'Auto', 'Auto', 'Camera', 'Auto'], dtype='<U6')


***SVC with a 33% test size and new random state correctly classifies all of the new_test opinions***



## Conclusions: 

- SVC appears to be a more suitable model for this opinion data

- The training and testing split of the data and the random state was changed and it gave better results for the new opinions
