# **WARNING** 

### We are dealing with raw web data. Some of the information that is retrieved might contain certain explicit content (words, phrases, or references)

# Data Engineering - NLP



## Exercise 1: NLP Tweets

For this exercise, use TfidfVectorizer and any TWO classification models you would like to correctly identify the sentiments of each review, in the Restaurant_Reviews.tsv file, as (Positive, or Negative). 

### Remember:
    1. Split your data into Train and Test sets
    2. Evaluate your model using the metrics of your choice (include a brief interpretation)
    3. Explain which model performed better and why (comparison of results)

In [58]:
#Exercise 1
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load the data
data = pd.read_csv('../data/Restaurant_Reviews.tsv',delimiter='\t')

# Separate the features and labels
reviews = data['Review']
sentiments = data['Liked']

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Convert text data to numerical features
features = vectorizer.fit_transform(reviews)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, sentiments, test_size=0.3, random_state=42)

# Initialize and train Logistic Regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

# Make predictions on the test set
logreg_predictions = logreg_model.predict(X_test)

# Evaluate the model
logreg_accuracy = accuracy_score(y_test, logreg_predictions)
logreg_report = classification_report(y_test, logreg_predictions)

# Initialize and train Random Forest Classifier model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_report = classification_report(y_test, rf_predictions)




"""
The classification report provides precision, recall, and F1-score metrics for both positive (1) and negative (0) sentiment classes. The accuracy score for the Random Forest Classifier model is 0.80, indicating that it correctly predicts the sentiment of approximately 80% of the test samples. The precision, recall,
and F1-score for each sentiment class are also provided, giving insight into the model's performance for each class.



"""

print(rf_report)
print(rf_accuracy)



              precision    recall  f1-score   support

           0       0.73      0.84      0.78       152
           1       0.81      0.68      0.74       148

    accuracy                           0.76       300
   macro avg       0.77      0.76      0.76       300
weighted avg       0.77      0.76      0.76       300

0.7633333333333333



## Exercise 2: App Review NLP work (Similar to Web Data workshop)

The Apple app store has a `GET` API to get reviews on apps. The URL is:

```
https://itunes.apple.com/{COUNTRY_CODE}/rss/customerreviews/id={APP_ID_HERE}/page={PAGE_NUMBER}/sortby=mostrecent/json
```

Note that you need to provide:

- The country codes (`'us'`, `'gb'`, `'ca'`, `'au'`) - use all four
- The app ID. This can be found in the web page for the app right after `id`.
    - You will need to find the IDs for these apps - Candy Crush, Facebook, Twitter & Tinder
- The "Page Number". The request responds with multiple pages of data, but sends them one at a time. So you can cycle through the data pages for any app on any country. (Be careful, there are limits to the number of pages you can access)

For example, Candy Crush's US webpage is `https://apps.apple.com/us/app/candy-crush-saga/id553834731`, which means that the ID is `553834731`.


Do the following:

1. Using any method you want (pre-trained models, dimensionality reduction, feature engineering, etc.) make the best **regression** model you can to predict the 5 star rating. Rate the accuracy in regression terms (mean squared error) and in classification terms (accuracy score, etc.)
1. Do the same as 1.1, but use a classification model. Are classification models better or worse to predict a 5-point rating scale? Explain in a few paragraphs and justify with metrics.

ps. Feel free to do as much data engineering to boost your model. (ie binary vs multinomial)


In [3]:
# exercise 2

import requests
import json
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score

# Function to get app reviews
def get_app_reviews(country, app_id, page_number):
    url = f"https://itunes.apple.com/{country}/rss/customerreviews/id={app_id}/page={page_number}/sortby=mostrecent/json"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return data['feed']['entry']
    else:
        print(f"Request failed with status code: {response.status_code}")
        return []

# Function to preprocess reviews
def preprocess_reviews(reviews):
    processed_reviews = []
    for review in reviews:
        content = review['content']['label']
        rating = int(review['im:rating']['label'])
        processed_reviews.append((content, rating))
    return processed_reviews


def analyze_reviews(country_codes, app_ids, page_limit):
    regression_results = []
    classification_results = []

    for country in country_codes:
        for app_id in app_ids:
            ratings = []
            reviews = []

            for page_number in range(1, page_limit + 1):
                app_reviews = get_app_reviews(country, app_id, page_number)
                if not app_reviews:
                    break

                processed_reviews = preprocess_reviews(app_reviews)

                for review in processed_reviews:
                    reviews.append(review[0])
                    ratings.append(review[1])

            # Perform regression and classification tasks
            if len(reviews) > 0:
                # Split data into train and test sets
                X_train, X_test, y_train, y_test = train_test_split(reviews, ratings, test_size=0.2, random_state=42)

                # Vectorize text data
                vectorizer = TfidfVectorizer()
                X_train_vectorized = vectorizer.fit_transform(X_train)
                X_test_vectorized = vectorizer.transform(X_test)

                # Regression Model
                regression_model = LinearRegression()
                regression_model.fit(X_train_vectorized, y_train)
                regression_predictions = regression_model.predict(X_test_vectorized)
                regression_mse = mean_squared_error(y_test, regression_predictions)
                regression_results.append(regression_mse)

               

    

                # Classification Model
                classification_model = LogisticRegression(multi_class='multinomial')
                classification_model.fit(X_train_vectorized, y_train)
                classification_predictions = classification_model.predict(X_test_vectorized)
                classification_accuracy = accuracy_score(y_test, classification_predictions)
                classification_results.append(classification_accuracy)

    return regression_results, classification_results


# Set parameters
country_codes = ['us', 'gb', 'ca', 'au']
app_ids = ['553834731', '284882215', '333903271', '547702041']
page_limit = 5

# Perform analysis
regression_results, classification_results = analyze_reviews(country_codes, app_ids, page_limit)

# Calculate mean squared error for regression
mean_squared_error_regression = sum(regression_results)



print("Mean Squared Error:",mean_squared_error_regression )


Mean Squared Error: 30.76496759709453


In [8]:


import requests

country_codes = ['us', 'gb', 'ca', 'au']
app_ids = ['553834731', '284882215', '333903271', '547702041']
page_number = 5

reviews = []
ratings_list = []

for country in country_codes:
    for app_id in app_ids:
        url = f"https://itunes.apple.com/{country}/rss/customerreviews/id={app_id}/page={page_number}/sortby=mostrecent/json"
        response = requests.get(url)
        data = response.json()
        entries = data['feed']['entry']
        for entry in entries:
            review = entry['content']['label']
            rating = float(entry['im:rating']['label'])
            reviews.append(review)
            ratings_list.append(rating)
            
# Print the outcome
print("Reviews:")
for review in reviews:
    print(review)

print("\nRatings:")
for rating in ratings_list:
    print(rating)



Reviews:
Please stop with the Candy Royals and portrait view. Games increasingly frustrating. I’m out.
I used to love this game, and would play while I was multi-tasking and watching TV to unwind at the end of a long day. But the latest update has taken away my ability as a the user to play in landscape mode.  The developers claim it’s to keep the content fresh, but to me, it just takes away my ability to personalize the gaming experience.  

Time to find a new game. :(
I keep losing my streak and I play every day. It’s frustrating.
When you are at the end of a long day…there’s always CRUSH.
Y’all messed it up; I can’t play on my iPad any longer. It won’t go to landscape.
This game (s) are fun and simple which I like when I am passing time and waiting to either do something or go somewhere.
.Love the game; hate the new format.  Please revert.
I feel like those 5 star reviews are all bought reviews because there’s a whole lot wrong with this game. The levels feel impossible to beat with