### Principal Component Analysis

## Reviews Preparation for Natural Language Processing

Add review_scores_rating from listings data to reviews data. Listings data only has review scores pertaining to the most recent review for a particular listing. This means that there will be many reviews that do not have a score, which we will remove during the merge.

In [None]:
#Set path to get cleaned listings data
path = r'Data\02_Intermediate\listings_cleaned.csv'

#Parse dates
parse_dates = ['last_review']

#Read in Airbnb cleaned_listings Data
listings = pd.read_csv(path,index_col=0, parse_dates=parse_dates, low_memory=False, sep='\t')

In [None]:
#Check listings
listings.head()

**Merge review_scores_rating from listings to corresponding reviews**

In [None]:
#Merge
review_scores = reviews.merge(listings.loc[:,['last_review','id','review_scores_rating']], how='left', left_on= ['listing_id', 'date'], 
                              right_on=['id', 'last_review'], suffixes=('_review', '_listings'))
#Check
review_scores.head()

### Cleaning Merged Data Set for NLP

In [8]:
#View review_scores shape
print('review_scores original data shape:',reviews.shape)

#View missing values in review_scores
print('Missing values:', review_scores.isna().sum())

NameError: name 'reviews' is not defined

In [None]:
#Drop unnecessary columns from review_scores
review_scores.drop(columns=['last_review', 'id_listings'], axis = 1, inplace= True)

#Rename columns
review_scores.rename(columns={'review_scores_rating':'review_rating'}, inplace=True)

#Drop duplicate values
review_scores.drop_duplicates(inplace=True)

#Strip leading and trailing white space
review_scores.comments = review_scores.comments.str.strip()

#View updated reviews shape and missing values
print('Updated reviews data shape:',review_scores.shape)
print('Missing values: \n', review_scores.isna().sum())

In [None]:
#Filter rows that do not contain english characters in the comments
review_scores.comments.replace('[^a-zA-Z0-9]',' ',regex = True, inplace=True)

#Remove puncuation from comments
review_scores.comments = review_scores.comments.str.replace(r'[^\w\s]+', '')

#Replace empty comments with nan
review_scores.comments = review_scores.comments.replace('', np.nan)

#Remove rows with missing comments and/or review_rating
review_scores.dropna(subset=['comments', 'review_rating'], inplace=True)

#View updated reviews shape
print('Updated reviews data shape:',review_scores.shape)
print('Missing values: \n', review_scores.isna().sum())

In [None]:
#filter out rows where comments are less than 2 characters long
review_scores = review_scores[review_scores.comments.apply(len) > 2]


In [None]:
#View updated reviews shape
print('Updated reviews data shape:',review_scores.shape)

#View review_scores
display(review_scores)

Text Analysis

Some questions to explore:
- What are the topics of negative reviews vs positive reviews?
- Does greater word usage correlate positively with a more negative experience?
- How would you advise first time hosts to increase the liklihood of a positive review?


Split data into training and test sets

In [None]:
# #Convert comments and review_rating into arrays
# X = review_scores['comments'].values #training data
# y = review_scores['review_rating'].values #target data

# #Check
# print(X.shape)
# print(y.shape)

In [None]:
# #instantiate test_train_split
# from sklearn.model_selection import train_test_split

# #Split data
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# #Check 
# print(X_train.shape, y_train.shape)
# print(X_test.shape, y_test.shape)
# print(X_train)

Tokenize comments

In [None]:
# #Plot histogram of review scores
# review_scores.review_rating.hist(bins=50)

# print(review_scores.shape)

In [None]:
# #Summon CountVectorizer
# from sklearn.feature_extraction.text import CountVectorizer

# #Instantiate CountVectorizer
# vect = CountVectorizer()

# #fit to training data
# vect = vect.fit(X_train)

# #Transform
# X_train_transformed = vect.transform(X_train)

In [None]:
# #View representation of X_train_transformed
# print('X_train_transformed representation: {}'.format(repr(X_train_transformed)))

# #Capture information about features
# feature_names = vect.get_feature_names()
# print('Number of features: {:,} '.format(len(feature_names)))
# print('\nEvery 2000th features: {}'.format(feature_names[::200]))

Naive implementation of SVR using 

In [None]:
# #Normalize word count matrix
# from sklearn.feature_extraction.text import TfidfVectorizer 

# #reduce the dimensionality to retain the first N components which capture the major variance
# from sklearn.decomposition import TruncatedSVD 

# from sklearn.svm import SVR

# #Summon Pipeline
# from sklearn.pipeline import Pipeline

# #Instantiate pipeline
# pipeline = Pipeline(steps=[('tfidf', TfidfVectorizer()), 
#                            ('svd', TruncatedSVD(random_state=42)), 
#                            ('clf', SVR())])
# #Check
# print(pipeline)

Pipeline Optimization

In [None]:
# #Summon RandomizedSearchCV
# from sklearn.model_selection import RandomizedSearchCV

# #Set Param grid for RandomizedSearchCV to explore
# param_grid= {'tfidf__max_df':(.5,.75, 1.0),
#              'svd__n_components': (50, 100, 150, 200),
#              'clf__C':(.1,1,10)}

# #Instantiate model
# random_search = RandomizedSearchCV(estimator=pipeline,param_distributions=param_grid,
#                             verbose=10, n_jobs=-1, scoring = 'r2')
# #Score
# random_search.fit(X_train, y_train)

In [None]:
# #View average score
# print("Best score: {:.3f}".format(random_search.best_score_))
# print("Best parameters set:")
# best_parameters = random_search.best_estimator_.get_params()


In [None]:
#test using regression

# Machine Learning

Is there value in capturing the numeric counts of amenities per listing? See below

In [None]:
# #Split strings on commas into features
# df.amenities=df.amenities.str.split(pat=',', expand = False)

# #Create amenities count and assign to df
# df['amenities_count'] = df['amenities'].str.len()