**1. ETL (ORIGINAL DATASET)**

# 1.1 Import the required libraries

The following libraries are common in Python programming environments, especially in the context of data analysis and data science. 


**Pandas (import pandas as pd):**

Pandas is a Python library that provides flexible date structures and date analysis tools. 
Importing it as pd is a common convention to shorten the library name and make the code more consice. 
Pandas is widely used to manipulate and analyze tabular datasets.


**JSON(import json):**

JSON (JavaScript Object Notation) is a lightweight, human-readable data interchange format.
The json library in Python allows for the serialization and deserialization of data in JSON format.
It can be used to read and write data in this format.


**AST (import ast):**

AST (Abstract Syntax Tree) is a hierarchical representation of the syntactic structure of a Python program.
The ast library allows you to analyze and manipulate the abstract syntax tree of a Python source code.
It can be useful for performing static analysis of the code.


**Regular Expressions (import re)**

The 're' module provides regular expression operations in Python.
Regular expressions are search patterns used to match text strings.
They are powerful tools for manipulating and searching specific patterns within text strings.


**%load_ext autoreload & %autoreload 2:**

These commands are specific to jupyter notebooks and are used to automatically reload modules before executing a cell '%load_ext autoreload' enables automatic module reloading, and '%autoreload 2' sets the autoreload to be more aggressive, even reloading module functions.


**Warnings (import warnings):**

The warnings module provides tools for controlling the warnings emitted by Python.
In this case, it's being configured to ignore warnings, which can be useful to prevent warnings from filling up the console output and distracting during code execution.
In summary, these imports are common in data analysis and date science environments in Python, providing tools for manipulating data, working with JSON, parsing regular expressions, and handling warnings.
Additionaly, the '%load_ext autoreload and %autoreload 2' commands are specific to jupyter notebooks and are used to facilitate interactive development.


**TextBlob (from textblob import TextBlob)**

TextBlob is a Python library that provides tools for natural language processing (NLP).
It enables tasks such as sentiment analysis, extraction of key phrases, part-of-speech tagging, etc.


**Nltk (import nltk):**

The Natural Language Toolkit (nltk) library is another powerful tool for natural language processing in Python.
It provides a range of modules and resources for tasks such as tokenization, syntactic analysis, stemming, among others.
In this case, you are importing the entire nltk module.


**CSV (import csv)**

The 'csv' module in Python provides funtionality for working with CSV (Comma-Separated Values) files.
CSV files are a common format for storing tabular data, where each row in the file represents a data entry and values are separated by commas or another delimiter.

The 'csv' module in Python provides functions for reading data from CSV files and writing data to CSV files.
Some important functions include 'csv.reader()' for reading a CSV file and 'csv.writer()' for writing to a CSV file.




In [9]:

'''Necessary libraries.'''
import pandas as pd                 # Pandas for tabular data manipulation.
import json                         # Module for working with JSON.
import ast                          # Module for evaluating Python literal expressions.
import re                           # Module for working with regular expressions.
from textblob import TextBlob       # I import TextBlob from the textblob library.
import nltk                         # Natural Language Toolkit.
import csv                          # I import the CSV module into Python.

'''Enable auto-reload of modules before executing a cell'''
%load_ext autoreload
%autoreload 2

'''Import the warning module and set it to ignore all warnings'''
import warnings
warnings.filterwarnings("ignore")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1.2 Auxiliary Functions

**1.2.1 Check Data Type**

We're checking the data types contained in the columns of "df_games".

This function called 'check data types' takes a DataFrame as input and returns a new DataFrame that provides information about the data types and the quantity of null values in each column of the input DtaFrame.

This function offers a useful summary of the features of the columns in a DataFrame, incluiding data types and the quantity of null values.
It's especially useful for conducting an initial data quality analysis on a dataset.

In [10]:
def check_data_type(df):
    
    '''A dictionary (my_dict) is created with five keys: "field_name", "data_type", "non_null_%", "null_%" and "nulls". 
    These keys will be used to store information about each column of the DataFrame.'''
    
    my_dict = {"field_name" : [], "data_type" : [], "non_null_%" : [], "null_%" : [], "nulls" : []}
    
    
    '''A loop is performed over all the columns of the DataFrame df'''
    for column in df.columns:
        percentage_non_nulls = (df[column].count() / len(df) * 100)     # The percentage of non-null values in the current column is calculated 
        my_dict['field_name'].append(column)                            # The current column is added to the list under the key 'field_name'        
        my_dict['data_type'].append(df[column].apply(type).unique())    # Obtain the unique data type in the current column and add it to the list under the key 'data type'.
        my_dict['non_null_%'].append(round(percentage_non_nulls, 2))    # Add the percentage of non-null values to the list under the key 'non_null_%'.
        my_dict['null_%'].append(round(100 - percentage_non_nulls, 2))  # Add the percentage of null values to the list under the key 'null_%'.
        my_dict['nulls'].append(df[column].isnull().sum())              # Add the number of null values in the current column to the list under the key 'nulls'.
        
        '''The dictionari my_dict is used to create a new DataFrame called df_info.'''
        df_info = pd.DataFrame(my_dict)
        
        '''The function returns the DataFrame df_info containing information about each column,
        incluiding the column name, data type, percentage of non-null values, percentage of null values, and the number of null values.'''
        return df_info

**1.2.2 Check duplicates by columns**

The following function provides a useful tool for identifying and sorting duplicate rows in a pandas DataFrame based on the values of a specifict column.
In our case, it can be useful for data analysis when examining and handling duplicates based on a particular column.

In [11]:
def check_duplicates_by_columns(df, column):
    
    '''Duplicate rows are filtered'''
    duplicated_rows = df[df.duplicated(subset=column, keep=False)]
    if duplicated_rows.empty:
        return 'There are no duplicates'
    
    '''The duplicate rows are sorted for comparison'''
    duplicated_rows_sorted = duplicated_rows.sort_values(by=column)
    return duplicated_rows_sorted

**1.2.3 Convert date**

The following function takes a date string in a specifict format, attempts to extract and convert that date to a different format ('YYYY-MM-DD'),
and returns the resulting date or an error message if the string does not match the expected format.

In [12]:
def convert_date(date_string):
    
    '''Searches the date string for a pattern matching the format "month, day, year'''
    match = re.search(r'(\w+\s\d{4})', date_string)

    if match:
        '''If there is a match, it extracts the date string'''
        date_str = match.group(1)
        try:
            '''It tries to convert the date string to a Pandas date object'''
            date_dt = pd.to_datetime(date_str)
            '''It formats the resulting date into the "YYY-MM-DD" format and retuns it'''
            return date_dt.strftime('%y-%m-%d')
        except:
            '''In case of an error during conversion, it returns "Invalid date".'''
            return 'Invalid date'
    
    else:
        '''If there is no match, it returns "Invalid format".'''
        return 'Invalid format'

**1.2.4 Sentiment analysis**

This finction provides a basic way to categorize the sentiment of a text into positive, negative, or neutral based on the polarity calculated by TextBlob.

In [13]:
'''Definition of the sentiment analysis function.'''
def sentiment_analysis(review):             # Checks if the review is None.
    
    if review is None:                      # If affirmative, it returns 1, which could be interpreted as a neutral.
        return 1
    
    analysis = TextBlob(review)             # Creates an instance of the TextBlob class with the provided review.
    polarity = analysis.sentiment.polarity  # Gets the sentiment polarity from the TextBlob analysis.
    
    if polarity < -0.2:                     # Compares the polarity with thresholds to determine the overall sentiment.
        return 0                            # If the polarity is less than -0.2, it's considered a negative sentiment and returns 0.
    
    elif polarity > 0.2:                    # If the polarity is greater than 0.2, it's considered a positive sentiment and returns 2.
        return 2
    
    else:
        return 1                            # In other cases, it return 1, which could be interpreted as a neutral sentiment.
    
             


**1.2.5 Analysis of example reviews by sentiment**

The function 'examples_reviwe_by_sentiment' is used to analyze and present examples of reviews classified according to their sentiments.
The function takes two lists as parameters: 'reviews', which contains the reviews, and 'sentiments', which contains the sentiment values associated with each review.

The function iterates through three sentiment categories.

0 for negative  

1 for neutral  

2 for positive  


It then displays examples of reviews corresponding to each category.
For each category, it prints the category number and filters the reviews that have that sentiment value.
Then, it presents the first three examples of reviews from that category.


In [14]:
def examples_reviews_by_sentiments(reviews, sentiments):
    
    for sentiment_value in range(3):
        print(f'For the sentiment analysis category {sentiment_value}, here are some examples of reviwes')
        sentiment_reviews = [reviews for reviews, sentiment in zip(reviews, sentiments) if sentiments == sentiment_value]
        
        for i, reviews in enumerate(sentiment_reviews[:3], start=1):
            print(f'Review {i}: {reviews}')
            
        print('\n')

# 1.3 ETL - australia_user_reviews


The code below loads a dataset from a JSON file, converts it into a pandas DataFrame, and finally returns that DataFrame as df_reviews.

In [27]:
'''Dataset path australia_user_reviews'''

path_review = 'C:\\Users\\migue\\Proyecto_Individual_1_MLOps\\PI MLOps - STEAM\\australian_user_reviews.json'


'''Each line of the dataset is read'''
rows_review = []
with open(path_review, encoding='utf-8') as f:
    '''Each line of the JSON file is iterated over'''
    for line in f.readlines():
        '''Ast.literal_eval is used to evaluate the line as a Python literal expression (converting JSON to Python).'''
        rows_review.append(ast.literal_eval(line))
        
'''It is converted into a DataFrame'''
df_reviews = pd.DataFrame(rows_review)
df_reviews


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."
