# Data Exploration
## Data provenance and characteristics:
 
### When and from where it was collected:
Data was collected from the GitHub presented in moodle:  https://github.com/Jl-wei/APIA2022-French-user-reviews-classification-dataset

### Text genre(s) and language(s) it covers
The data represents french user reviews from three applications on Google Play: Garmin Connect, Huawei Health and Samsung Health

### How it has been annotated
Each entry in the data set is annotated with four labels: "Rating", "User Experience", "Feature request" and "Bug Report"
It also includes the score given by the user and the written review.
We can see here the data organized per app and type of review.

| App            | Total | Rating | Bug report | Feature request | User experience |
| -------------- | ----- | ------ | ---------- | --------------- | --------------- |
| Garmin Connect | 2000  | 1260   | 757        | 170             | 493             |
| Huawei Health  | 2000  | 1068   | 819        | 384             | 289             |
| Samsung Health | 2000  | 1324   | 491        | 486             | 349             |



### Importing the data

In [None]:
import pandas as pd

# Read the data from the file
data_garmin_df = pd.read_csv('C:/Users/afons/Documents/2022-2023/2nd_semester/PLN/NLP/data/Garmin_Connect.csv')
data_samsung_df = pd.read_csv('C:/Users/afons/Documents/2022-2023/2nd_semester/PLN/NLP/data/Samsung_Health.csv')
data_huawei_df = pd.read_csv('C:/Users/afons/Documents/2022-2023/2nd_semester/PLN/NLP/data/Huawei_Health.csv')

data = pd.concat([data_garmin_df, data_samsung_df, data_huawei_df], ignore_index=True)

print(data.head())

### Amount of examples per class
We can conclude that we have a solid amount of data in each class, which will help us in the classification stage. Despite this there are some classes like rating with a lot more examples then feature request, we will have to keep that in mind and use stratification to guarantee a representative amount of data in the training and testing sets. 

In [None]:
import matplotlib.pyplot as plt

# print amount of examples per class in the dataset
print("Number of examples: ", len(data))
print("Rating: ", len(data[data['rating'] == 1]))
print("User Experience: ", len(data[data['user_experience'] == 1]))
print("Bug Report: ", len(data[data['bug_report'] == 1]))
print("Feature Request: ", len(data[data['feature_request'] == 1]))

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

data_labels = data[["rating", "bug_report", "feature_request", "user_experience"]]

data_labels.sum(axis=0).plot.bar()

# combine all text columns into one list
raw_text = " ".join(data['data'].tolist())


As we can see, the rating label is much more relevant than the other 3 labels, which causes unbalance and might mislead the results if not careful. Also, it seems that there are cases where more than 1 label can be applied. This makes it a multilabel problem:

![Types of classification problems (https://towardsdatascience.com/multilabel-text-classification-done-right-using-scikit-learn-and-stacked-generalization-f5df2defc3b5#6de1)](./data/Types_of_classification_problems.png "Types of classification problems")

We also explored the overlap of classes, since this is a multilabel classification problem. We concluded that still most of the reviews are only ratings, the biggest overlapping was rating and bug report. 

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

tags = []

for index, row in data.iterrows():
    str_tag = ""
    if row["rating"] == 1:
        str_tag += "rating,"
    if row["bug_report"] == 1:
        str_tag += "bug_report,"
    if row["feature_request"] == 1:
        str_tag += "feature_request,"
    if row["user_experience"] == 1:
        str_tag += "user_experience,"

    list_tag = str_tag[0:-1].split(',')

    tags.append(list_tag)

data = data.assign(tags=tags)


mlb = MultiLabelBinarizer()

tags1 = mlb.fit_transform(data["tags"])

data["tags"].value_counts().sort_index().plot.bar(x="Tag Distribution of All Observations", y="Number of observations")

## Word distribution (TF-IDF)


We calculated the most frequent tokens in the reviews and developed a word cloud to help visualize it. 

In [None]:
# tokenize the text
from nltk.tokenize import word_tokenize

def flatten(l):
    return [item for sublist in l for item in sublist]

tokenized_text = flatten(data['data'].apply(word_tokenize).tolist())

print(tokenized_text)

In [None]:
from collections import Counter

unique_tokens = set(tokenized_text)

count_dict = {}
for type in unique_tokens:
    count_dict[type] = raw_text.count(type)

token_counter = Counter(count_dict)
print(token_counter.most_common(100))

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud().generate(" ".join(tokenized_text))

plt.figure()
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

We created a box plot to visualize if there is any relationship between score and the type of review and concluded we can extract information from the score because different type of reviews were given different scores.

In [None]:
score_columns = ['rating', 'bug_report', 'feature_request', 'user_experience']

# Create a box plot to show the distribution of scores for each type of comment
fig, ax = plt.subplots(figsize=(10, 8))
ax.boxplot([data[data[col]==1]['score'] for col in score_columns], labels=score_columns)
ax.set_xlabel('Type of Comment')
ax.set_ylabel('Score')
ax.set_title('Score by Type of Comment')

# Show the plot
plt.show()