# CLEANING AND DESCRIPTIVE DATA ANALYSIS

---

In [17]:
# LIBRARIES
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import utils as eda
import re

# URL
URL = r'C:\Users\Francesc\Documents\GitHub\Naive-Bayes-Project-Tutorial\data\raw\total_data_raw.csv'
total_data = pd.read_csv(URL)

<small>Note: The 'utils.py' file contains **specific functions** according to the standards of Exploratory Data Analysis (EDA) and Descriptive Data Analysis (DDA). The functions with the prefix 'eda' are described in the 'utils.py' doc. <small>

1. General information about the dataset, including its shape, column names, presence of null entries, and data types:

In [18]:
# general info and shape
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


2. Check for duplicate values:

In [19]:
# Duplicates
total_data.duplicated().sum()

0

3. Remove the non-related variables

In [23]:
total_data.drop(['package_name'], axis=1, inplace=True)

total_data.head(3)

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0


4. Removing spaces and converting the text to lowercase:

In [24]:
total_data["review"] = total_data["review"].str.strip().str.lower()

total_data.head(3)

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0


5. Removing special characters

In [29]:
total_data['review'] = total_data['review'].apply(lambda x: re.sub(r'[^a-z0-9\s]', '', x))
for value in total_data['review']:
    print(value)

total_data.to_csv('clean_total_data.csv',index=False)

privacy at least put some option appear offline i mean for some people like me its a big pressure to be seen online like you need to response on every message or else you be called seenzone only if only i wanna do on facebook is to read on my newsfeed and just wanna response on message i want to pls reconsidered my review i tried to turn off chat but still can see me as online
messenger issues ever since the last update initial received messages dont get pushed to the messenger app and you dont get notification in the facebook app or messenger app you open the facebook app and happen to see you have a message you have to click the icon and it opens messenger subsequent messages go through messenger app unless you close the chat head then you start over with no notification and having to go through the facebook app
profile any time my wife or anybody has more than one post and i view them it would take me to there profile so that i can view them all at once now when i try to view them i

## DESCRIPTIVE DATA ANALYSIS CONCLUSIONS

**Dataset Size:**
The dataset comprises 891 entries and 3 columns, 2 columns representing features and 1 the outcome called polarity.

**Data Type:**
All the predictive variables are string. The target variable 'polarity' is numerical categorical, with '0' indicating a negative outcome and '1' indicating a positive outcome.

**Missing Values:**
The dataset is complete, with no missing values observed across all entries. 

**Duplicates:**
No duplicate entries exist in the dataset; each row appears to be distinct.
