# Lab 4

In this lab, you will experience text processing and how to deal with text files.

Use the link below to download the dataset you are working on. 

https://ubcca-my.sharepoint.com/:u:/g/personal/fatemeh_fard_ubc_ca/Ea0codoUmEVGon4ZiXqdHowBuLESdO824rcEj9uAH30FYg?e=q0NfCU

The dataset includes hundreds of csv files. Each one name is a name of a mobile app. In each file, you will see multiple columns, including ``content`` which is the review written by the users, ``score`` which is the score given to the app, ``thumbsUp``, ``reviewCreatedVersion`` defining on which version of the app the review is for, ``at`` which is the time and date of the review, ``replyContent`` which holds the answers from the app developers to the review, and ``repliedAt`` which is the time and date the reply is written. 

For this dataset, answer the following. 

1) How many files are there in the folder (in other words, how many apps)?

2) How many app reviews do we have in total? 

3) What is the range of the app reviews, meaning the earliest review and the latest one? Include both date and time.

4) How many reviews are written in each month of 2021? 

5) What is the frequency of the reviews written in each hour in the month of August 2021? Visualize this in a bar chart.

6) What is the average score for all apps? 

7) What is the frequency of the scores for all apps? Visualize it with a bar chart.

8) What is the frequency of the thumbsUp for each reply? Visualize it.

Consider the column `content` to answer the following questions.

9) Pre-process the text by removing the following characters from the review texts: ``. , " : ; ' ) ( * ``

10) Pre-process the text by removing multiple occurrences of question marks and exclamation marks. For example, convert ``The app is super easy to work with!!!!!`` to ``The app is super easy to work with!``.

11) OPTIONAL: apply sentiment analysis on the reviews to analyze whether the positive reviews are the ones that recieve high scores and negative reviews are the ones that receive low scores. Is there any correlation between the setiment of the reviews and the `score` given by the user?

Hint: you can write the function to clean one review and then apply it on the specified column using ``apply`` function.


**Dataset licensing:** 

You are not allowed to share this dataset publicly or use it for any other purpose other than this lab. This is a dataset collected by the students in my lab for their graduate thesis.

## If graphs don't show up in .ipynb file then check the html file - they should show up on there.

In [1]:
import os
import pandas as pd
from datetime import datetime
import altair as alt
import numpy as np
from IPython.display import display
import re

In [2]:
def process_text(text):
    text = str(text)
    text = re.sub(r'[.,":;\'()*]', '', text)
    text = re.sub(r'\?+', '?', text)
    text = re.sub(r'!+', '!', text)

    return text

folder_path = 'sheets'
csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]
num_csv_files = len(csv_files)

print("1. There are", num_csv_files, "files in the folder.")

review_count = 0
first_review = '2019-01-01'
last_review = '2019-01-01'
month_2021 = pd.Series()
hour_1_2021 = pd.Series()
avg_score = []
score = pd.Series()
thumbsUp = pd.Series()
for csv in csv_files:
    file_path = os.path.join(folder_path, csv)
    
    # Check if the file is empty by opening and reading the first non-whitespace character
    with open(file_path, 'r') as file:
        first_char = next((char for char in file.read(1) if not char.isspace()), None)
        
    # Skip if the file is empty
    if first_char is None:
        continue
    
    df = pd.read_csv(file_path)
    avg_score.append(df['score'].mean())
    review_count += len(df)

    if df['at'].min() < first_review:
        first_review = df['at'].min()
    if df['at'].max() > last_review:
        last_review = df['at'].max()

    df['at'] = pd.to_datetime(df['at'], format='%Y-%m-%d %H:%M:%S')

    count_months = df[(df['at'].dt.year == 2021)]['at'].dt.month.value_counts().sort_index()
    month_2021 = month_2021.add(count_months, fill_value=0)

    count_hours = df[(df['at'].dt.year == 2021) & (df['at'].dt.month == 1)]['at'].dt.hour.value_counts().sort_index()
    hour_1_2021 = hour_1_2021.add(count_hours, fill_value=0)

    count_score = df['score'].value_counts().sort_index()
    score = score.add(count_score, fill_value=0)

    count_thumbsUp = df['thumbsUpCount'].value_counts().sort_index()
    thumbsUp = thumbsUp.add(count_thumbsUp, fill_value=0)
    df['cleaned_content'] = df['content'].apply(process_text)

print("2. In total, there are", review_count, "reviews.")
print("3. First review was on", first_review, "Last review was on", last_review, "")
print("4. Number of reviews by month in 2021:")
print(dict(sorted(month_2021.items())))
print("5. Frequency of reviews by hour in January 2021")
month_data = pd.DataFrame(list(hour_1_2021.items()), columns=['Hour', 'Count of Reviews'])
chart = alt.Chart(month_data).mark_bar().encode(
    x='Hour',
    y='Count of Reviews'
)
display(chart)
print("6. Average score for all apps:")
print(round(np.mean(avg_score),2))
print("7. Frequency of scores for all apps:")
score_data = pd.DataFrame(list(score.items()), columns=['Score', 'Count of Score'])
chart2 = alt.Chart(score_data).mark_bar().encode(
    x='Score',
    y='Count of Score'
)
display(chart2)
print("8. Frequency of Thumbs Up shown as a density plot:")
thumb_data = pd.DataFrame(list(thumbsUp.items()), columns=['Number of Thumbs Up', 'Count'])
chart3 = alt.Chart(thumb_data).transform_density(
    'Number of Thumbs Up',
    as_=['Number of Thumbs Up', 'density']
    ).mark_area().encode(
    x=alt.X('Number of Thumbs Up', title='Thumbs Up'),
    y='density:Q'
)
display(chart3)
print("9. and 10: - example of last processed text.")
print(df['cleaned_content'])

1. There are 3686 files in the folder.


  df = pd.read_csv(file_path)


2. In total, there are 2877020 reviews.
3. First review was on 2009-02-20 14:07:32 Last review was on 2021-02-10 05:32:18 
4. Number of reviews by month in 2021:
{1: 40238.0, 2: 11583.0}
5. Frequency of reviews by hour in January 2021


6. Average score for all apps:
4.0
7. Frequency of scores for all apps:


8. Frequency of Thumbs Up shown as a density plot:


9. and 10: - example of last processed text.
0      Super awesome 👍🍰 stellar concept 💋 Love it ALL...
1      Needs updated cant even get into the ranking l...
2                Love it! But needs some upgrades please
3      This game needs more of everything it already ...
4      It took 5 attempts before my pet ccame out of ...
                             ...                        
325    Great  but should make screen biggerand to get...
326    Great widget really love it But how Im increas...
327      Bit too small Also how do i get its energy up?!
328    Fun and cuteperfect for my kids to carefor! En...
329    I bought this app my phone told me it was inst...
Name: cleaned_content, Length: 330, dtype: object
