## Subteam 1: Creating a dictionary out of the review text file.

Here is Subteam 1's results. In our portion of the project, we are making a dictionary out of the provided text file and in combination with LIWC's dictionary, we will use this to create NLP models later on.


Here we just mount this google drive to work with the files in the folder

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')
import numpy as np

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


First, we will need to make a pseudo LIWC dictionary until we can have access to the real one

In [None]:

#Our fake LIWC dictionary that we will replace later with a real one. Changing this will change everything, making our dictionary and ndarray bigger.
LIWC = {'Self': ['I', 'we', 'me', 'my'], 'Dirty':['Disgusting', 'Gross', 'Rats', 'Run Down'],'Blah': ['No', 'crazy', 'aweful']}


After we have made this function, we will load in our file of reviews that was shared with us.

In [None]:
#read and store the whole file to pass into our func later
f = open("/content/gdrive/MyDrive/CSE-CSB_Capstone/Subteam-1/output_review_yelpHotelData_NRYRcleaned.txt","r")
f_content = f.read()

#an array that will store each of the \n seperated reviews from the file.
reviews=[]

#open the file and seperate each review by the \n char
with open("/content/gdrive/MyDrive/CSE-CSB_Capstone/Subteam-1/output_review_yelpHotelData_NRYRcleaned.txt","r",newline="\n") as file:
    count = 0
    #for each of the 
    for review in file:
        reviews.append(review.rstrip())
print("Number of reviews in the file: ",len(reviews))
#del review,file


Number of reviews in the file:  5854


Then, we will create a function that will take a piece of text, and using our LIWC dictionary, compare words in the text to the categories in the LIWC dict. This function will then return a dictionary of our own that will contain each category and the number of times each category appeared in the text.

In [None]:
#This function takes in a corpus, extracts the words in the categories that are present in the LIWC dictionary from that body of text. Then, it counts the number of occurances 
#of those words, and returns a dictionary with categories as keys and the number of times each category's words occur as values.
def LIWC_feature_extraction(text):
    #create a dictionary that will hold our values
    feature_dict={}

    #for every key in the LIWC, we compare our text to the categories and 
    #then we increment whenever a word from the key/ value appears in our text
    for key in LIWC:
      num_occur = 0
      for i in LIWC[key]:
        num_occur += text.count(i)
      feature_dict[key] = num_occur

    
    return feature_dict

Here we demonstrate the LIWC_feature_extraction function

In [None]:
#Testing the LIWC_feature_extraction() functionality
#if we want all of the words in the file to be analyzed
print(LIWC_feature_extraction(f_content))

#a specific review in our file to be analyzed
print(LIWC_feature_extraction(reviews[4]))

#lets store the dictionary that results from the feature extraction in review_dict
review_dict = LIWC_feature_extraction(f_content)

{'Self': 76654, 'Dirty': 19, 'Blah': 1566}
{'Self': 5, 'Dirty': 0, 'Blah': 0}


After this, we create a function that will convert our dictionary of review into a numpy.ndarray

In [None]:
#This function takes in a dictionary, and then returns a 1xn numpy array of the features of the dictionary with the number of occurances of each 
#key as its elements
def dict_to_nparray(dictionary):
  #associate each characteristic to a number, and then: [[70484,19]] (1x2 matrix) return matrix will be 5854 elements
  vals = np.fromiter(dictionary.values(), dtype=int)
    
  # reshape the the numpy array to be a 1x(number of keys in dictionary) matrix
  vals = vals.reshape(1,len(dictionary))

  #return an ndarray that will hold the number of times each characteristic appears for the corpus
  return vals






Now, for each review in the text file, we append to a vector of arrays of feature counts, a 2xn numpy array

In [None]:
#create an array to fill with values later that is the size of 1xnumber of features in LIWC size.
rfeatures = LIWC_feature_extraction(reviews[0])
all_reviews_array = dict_to_nparray(rfeatures)

#for each review in the list of reviews, extract the features from it, count the number of occurances of each feature, 
#store in a numpy array, and append it to a 2xn numpy vector of all the feature counts of each review. 
for review in reviews:
  if review == reviews[0]:
    continue
  else:
    rfeatures = LIWC_feature_extraction(review)
    x = dict_to_nparray(rfeatures)
    all_reviews_array = np.append(all_reviews_array,x,axis=0)

Here we test randomly to make sure everything is in working order

In [None]:

#testing
print("Len of the reviews numpy 2xn array:")
print(len(all_reviews_array))
print()
print("Review 1 feature counts: ")
print(all_reviews_array[0])
print("Comparing against feature extraction function's results: ")
print(LIWC_feature_extraction(reviews[0]))
print()
print("Review 2 feature counts: ")
print(all_reviews_array[1])
print("Comparing against feature extraction function's results: ")
print(LIWC_feature_extraction(reviews[1]))
print()
print("Review 6 feature counts: ")
print(all_reviews_array[5])
print("Comparing against feature extraction function's results: ")
print(LIWC_feature_extraction(reviews[5]))
print()
print("Review 5854 feature counts: ")
print(all_reviews_array[5852])
print("Comparing against feature extraction function's results: ")
print(LIWC_feature_extraction(reviews[5852]))

#I would like to create a way to go through each review and 

Len of the reviews numpy 2xn array:
5854

Review 1 feature counts: 
[21  0  0]
Comparing against feature extraction function's results: 
{'Self': 21, 'Dirty': 0, 'Blah': 0}

Review 2 feature counts: 
[1 0 0]
Comparing against feature extraction function's results: 
{'Self': 1, 'Dirty': 0, 'Blah': 0}

Review 6 feature counts: 
[3 1 1]
Comparing against feature extraction function's results: 
{'Self': 3, 'Dirty': 1, 'Blah': 1}

Review 5854 feature counts: 
[17  0  0]
Comparing against feature extraction function's results: 
{'Self': 17, 'Dirty': 0, 'Blah': 0}


In [None]:
def key_category(nparray):
  max_val_index = np.argmax(nparray) #max of the array to use as an index for our key
  keys_list = list(LIWC)
  dict_cat = keys_list[max_val_index]
  return dict_cat


#testing this function
print("Most frequent category of review 1: ")
print(key_category(all_reviews_array[0]))
print()
print("Most frequent category of review 6: ")
print(key_category(all_reviews_array[5]))

Most frequent category of review 1: 
Self

Most frequent category of review 6: 
Self


SVM:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing

#read in data
#LIWC_result=pd.read_csv("LIWC-outputOnContentsAndNY.csv");


#create the classifier df
df_X=LIWC_result.drop(["realOrFake","Unnamed: 0","content","unknown1","unknown2","unknown3","Segment","dates","reviewID","reviewerID","productID"],axis="columns")

#create the target df
df_Y=LIWC_result.realOrFake


#split into training and testing sets by a 2:8 ratio 
Xtr,Xte,Ytr,Yte=train_test_split(df_X,df_Y,test_size=0.2)

#parameter range
#param_grid = {'C': [ 1, 10], 
#             'gamma': [1,0.1],
#             'kernel': ['rbf']} 

#search for best parameters
#grid = GridSearchCV(model, param_grid, refit = True, cv=15)

#prepossess the data, so that grid.fit doesn't run for several hours
#Xtr = preprocessing.scale(Xtr) 

#run this for several minutes
#grid_search=grid.fit(Xtr,Ytr)

#print(grid_search.best_estimator_)

#create model
model=svm.SVC(C=1,gamma=1)
model.fit(Xtr,Ytr)
print(model.score(Xte,Yte))

NameError: ignored

**TODO:** Create a NLP Model that will take in our vector of each review's category and createa  

In [None]:
#read liwc results in as df
import pandas as pd
LIWC_result=pd.read_csv("LIWC-outputOnContentsAndNY.csv");

FileNotFoundError: ignored

In [None]:

#This week HW 4/4/22:

#read LIWC documentation and reformat the data to fit the LIWC software
#load this data into LIWC and see the report, and maybe try and download this as data for our manipulation
#build a classifier on top of LIWC report
#LIWC has a commandline, we want a commandline tool to process the data and then Python can execute cmd line 
#LIWC report as an offline file and read from this file. 

#REALLY try to use cmd line
#for each review have an array of each of the above
#numpy stack (multiple vectors into a big matrix)
#np.vstack or np.stack

#start building classifier once you get all the report


#Sprint 10 HW:
# Find a classifier that can model the discreet data we now have
# Logistic regression would be a good starting point (data normalization technique (normalize the large and small values to a similar ))
# use sklearn package for data preprocessing

#---------------------------------------------------------------------------------------------------------------------------------------------------------------------
# LIWC reads like IRIS dataset 
# read LIWC csv into a dictionary
#df2.to_csv(r'ss_review.txt', header=None, index=None)


#generate a csv file from the results of the LIWC software report (LIWC already does this)




