# Nova Financial Solutions - Exploratory Data Analysis (EDA)

## Data Cleaning and Processing

### Import Functions
In this section, import all necessary modules from the `shared` module to load and clean the raw dataset.

In [None]:
from shared.dataset_loader import load_raw_dataset
from shared.data_cleaning import (
     date_to_datetime,
     handle_missing_values, 
     standardize_text,
     handle_duplicates,
     save_processed_dataset
)

### Load Raw Dataset
First, we need to load the raw dataset using the `load_raw_dataset` function. This function loads the data from a CSV file and returns a Pandas DataFrame.

In [None]:
raw_data = load_raw_dataset()

### Convert date to datetime
This function takes the raw dataset as input and converts any date columns to datetime format.

In [None]:
processed_data = date_to_datetime(raw_data)

### Handle missing values
The `handle_missing_values` function handles missing values in the dataset. It Drop rows with empty headline, url, date, stock columns and Replace empty row publisher column with Unknown.

In [None]:
# Call the handle_missing_values function on the processed dataset
processed_data = handle_missing_values(processed_data)

### Standardize text
The `standardize_text` function is used to standardize text data in the dataset. It performs tasks such as converting text to lowercase, removing special characters or punctuation.

In [None]:
# Call the standardize_text function on the processed dataset
processed_data = standardize_text(processed_data)

### Handle duplicates
The `handle_duplicates` function deals with duplicate records in the dataset. It Check for duplicate rows based on the url column & Keep only the row with the latest date compared to all duplicate rows.

In [None]:
# Call the handle_duplicates function on the processed dataset
processed_data = handle_duplicates(processed_data)

### Save processed dataset
The `save_processed_dataset` function allows to save the cleaned and preprocessed dataset to a CSV file. Create the output folder if it doesn't exist and Save the cleaned dataset.

In [None]:
# Save the processed dataset
output_folder = os.path.join('..', 'data', 'processed') 
save_processed_dataset(processed_data, output_folder)

## Data Analysis using the Cleaned and Processed Dataset

### Import Functions
In this section, import all necessary functions from the `shared` module for loading the cleaned dataset and perfom analysis.

In [1]:
from shared.dataset_loader import load_cleaned_dataset
from shared.descriptive_analysis import (
     get_text_length_stats,
     count_articles_per_publisher,
     identify_unique_domains
)
from shared.inferential_analysis import (
     analyze_publication_dates,
     perform_sentiment_analysis
)

### Load Cleaned Dataset
First, we need to load the cleaned dataset using the `load_cleaned_dataset` function. This function loads the data from a CSV file and returns a Pandas DataFrame.

In [2]:
cleaned_data = load_cleaned_dataset()

## Descriptive Statistical Analysis
In this section, we define functions that perform Descriptive statistical data analysis tasks on a iven dataset.


### Text length stats

This function calculates various statistics related to the length of the headlines in the given dataset. It takes the dataset as input and returns a dictionary with the following statistics:

- `mean`: The mean length of the headlines.
- `median`: The median length of the headlines.
- `minimum`: The minimum length of the headlines.
- `maximum`: The maximum length of the headlines.
- `standard_deviation`: The standard deviation of the lengths of the headlines.


In [3]:
from IPython.display import display
stats = get_text_length_stats(cleaned_data)
print("\nNews Stats:")
display(stats)


News Stats:


{'mean': 73.12051490484095,
 'median': 64.0,
 'minimum': 3,
 'maximum': 512,
 'standard_deviation': 40.73530993195065}

### Count number of articles per publisher

In [4]:
publisher_counts = count_articles_per_publisher(cleaned_data)
# Print the results
print("News frequency by day of the week:")
print(publisher_counts)

News frequency by day of the week:
publisher
Paul Quintaro                      228373
Lisa Levin                         186979
Benzinga Newsdesk                  150484
Charles Gross                       96732
Monica Gerson                       82380
                                    ...  
Shazir Mucklai - Imperium Group         1
Laura Jennings                          1
Eric Martin                             1
Jose Rodrigo                            1
Jeremie Capron                          1
Name: count, Length: 1034, dtype: int64


### Identify unique domains

In [5]:
unique_domains = identify_unique_domains(cleaned_data)
# Print the results
print("Unique Domains:")
print(unique_domains)

Unique Domains:
['' 'benzinga' 'gmail' 'andyswan' 'investdiva' 'eosdetroit' 'tothetick'
 'forextraininggroup' 'stockmetrix']


## Inferential Statistical Data Analysis
In this section, we define functions that perform Inferential statistical data analysis tasks on the cleaned dataset.

### Analyze the publication dates
Analyze the publication dates in the dataset to identify trends over time

In [None]:
analyze_publication_dates(cleaned_data)

### Perform sentiment analysis on headlines

In [None]:
perform_sentiment_analysis(cleaned_data)