In [1]:
from preprocessing import *
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv('Restaurant reviews.csv')
df.head()

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures,7514
0,Beyond Flavours,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5,"1 Review , 2 Followers",5/25/2019 15:54,0,2447.0
1,Beyond Flavours,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5,"3 Reviews , 2 Followers",5/25/2019 14:20,0,
2,Beyond Flavours,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5,"2 Reviews , 3 Followers",5/24/2019 22:54,0,
3,Beyond Flavours,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5,"1 Review , 1 Follower",5/24/2019 22:11,0,
4,Beyond Flavours,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5,"3 Reviews , 2 Followers",5/24/2019 21:37,0,


PREPROCESSING

Remove the columns 'Restaurant', 'Pictures' and '7514' because they are not useful for the analysis.
Change the ratings different from a number to NaN and remove the rows with NaN values because it's just a little part of the dataset.
Replace the \n with a space in the column 'Reviews' to facilitate the preprocessing of the text.

In [3]:
# remove the columns that are not needed
df = df.drop(['Restaurant', 'Pictures', '7514'], axis=1)

# convert the rating to float
print(df["Rating"].unique())
def convert_to_float(x):
    try:
        return float(x)
    except (ValueError, TypeError):
        return np.nan
    
df["Rating"] = df["Rating"].apply(convert_to_float)
# remove the lines with missing values (low number of lines with missing values)
df = df.dropna()
# remove the \n in the text
df['Review'] = df['Review'].apply(lambda x: x.replace('\n', ' '))
# encoder to encode the reviewers name
label_encoder = LabelEncoder()
df['Reviewer_encode'] = label_encoder.fit_transform(df['Reviewer'])

print(df.shape)
df.head()

['5' '4' '1' '3' '2' '3.5' '4.5' '2.5' '1.5' 'Like' nan]
(9954, 6)


Unnamed: 0,Reviewer,Review,Rating,Metadata,Time,Reviewer_encode
0,Rusha Chakraborty,"The ambience was good, food was quite good . h...",5.0,"1 Review , 2 Followers",5/25/2019 15:54,4973
1,Anusha Tirumalaneedi,Ambience is too good for a pleasant evening. S...,5.0,"3 Reviews , 2 Followers",5/25/2019 14:20,765
2,Ashok Shekhawat,A must try.. great food great ambience. Thnx f...,5.0,"2 Reviews , 3 Followers",5/24/2019 22:54,954
3,Swapnil Sarkar,Soumen das and Arun was a great guy. Only beca...,5.0,"1 Review , 1 Follower",5/24/2019 22:11,6591
4,Dileep,Food is good.we ordered Kodi drumsticks and ba...,5.0,"3 Reviews , 2 Followers",5/24/2019 21:37,1616


Explanations of the instructions file:

{
"Metadata":{
    "EXTRACT_REGEX_PATTERN": {"regex_pattern": "\\d+ Review", "secondary_regex_pattern": "\\d+","new_column_name": ["nb_review"], "result_type": "int"},
    "EXTRACT_REGEX_PATTERN_2": {"regex_pattern": "\\d+ Follower", "secondary_regex_pattern": "\\d+","new_column_name": ["nb_follower"], "result_type": "int"}
},
"Time": {
    "EXTRACT_REGEX_PATTERN": {"regex_pattern": "/(\\d{4})\\s", "new_column_name": ["Year"], "result_type": "int"},
    "EXTRACT_REGEX_PATTERN_2": {"regex_pattern": "(\\d+)", "new_column_name": ["Month"], "result_type": "int"}
},
"Review": {
    "LOWERCASE": {},
    "REMOVE_URLS": {},
    "REMOVE_EMOJI": {},
    "REMOVE_EMOTICONS": {},
    "REMOVE_PUNCT": {},
    "CHAT_WORDS_CONVERSION": {},
    "REMOVE_STOPWORDS": {},
    "LEMMATIZE_ENGLISH": {}
}
}

Metadata:
    This column contains the number of reviews and the number of followers of the reviewer. We extract the number of reviews and the number of followers with a regex pattern and we create two new columns with the results. (data analysis file to see why we extract these two informations)

Time:
    This column contains the date of the review. We extract the year and the month with a regex pattern and we create two new columns with the results. (data analysis file to see why we extract these two informations)

Review:
    This column contains the text of the review. We apply some preprocessing steps to clean the text.
    We first lowercase the text and remove all the things that are not useful for the analysis like the urls, the emojis, the emoticons, the punctuation then we convert the chat words to the normal words and we remove the stopwords. Finally, we lemmatize the text to keep only the root of the words.

The order of the preprocessing steps is important to simplify the preprocessing and to avoid some problems. For example it's easier to lemmatize the text after removing the stopwords. It's also important to do the removing of the stopwords after the chat words conversion because it's possible that in the chat words there are some stopwords.

The lemmatization was prioritized over the stemming because it is more accurate to keep the meaning of the words and in our situation it's important.
Removing rare and frequent words was not necessary after seeing that these words were necessary for the analysis.

In [4]:
new_df = preprocessing("instructions.json", df)
new_df.drop(['Time', 'Metadata', 'Reviewer'], axis=1, inplace=True)
new_df.head()


Instruction EXTRACT_REGEX_PATTERN in progress...
Instruction EXTRACT_REGEX_PATTERN_2 in progress...
All instructions have been applied to the column Metadata.
Instruction EXTRACT_REGEX_PATTERN in progress...
Instruction EXTRACT_REGEX_PATTERN_2 in progress...
All instructions have been applied to the column Time.
Instruction LOWERCASE in progress...
Instruction REMOVE_URLS in progress...
Instruction REMOVE_EMOJI in progress...
Instruction REMOVE_EMOTICONS in progress...
Instruction REMOVE_PUNCT in progress...
Instruction CHAT_WORDS_CONVERSION in progress...
Instruction REMOVE_STOPWORDS in progress...
Instruction LEMMATIZE_ENGLISH in progress...
All instructions have been applied to the column Review.
All instructions have been applied.


Unnamed: 0,Review,Rating,Reviewer_encode,nb_review,nb_follower,Year,Month
0,ambience good food quite good saturday lunch c...,5.0,4973,1,2,2019,5
1,ambience good pleasant even service prompt foo...,5.0,765,3,2,2019,5
2,must try great food great ambience thnx servic...,5.0,954,2,3,2019,5
3,soumen das arun great guy behavior sincerety g...,5.0,6591,1,1,2019,5
4,food goodwe order kodi drumstick basket mutton...,5.0,1616,3,2,2019,5


In [5]:
new_df.to_csv('preprocessed_data.csv', index=False)