# Nova Financial Solutions - Exploratory Data Analysis (EDA)

## Data Loading & Cleaning

In this section, we have imported several functions from the `shared` module to load and clean the raw dataset.

In [1]:
from shared.dataset_loader import load_raw_dataset
from shared.data_cleaning import (
     date_to_datetime,
     handle_missing_values, 
     standardize_text,
     handle_duplicates,
     save_processed_dataset
)

### Load Raw Data
First, we need to load the raw dataset using the `load_raw_dataset` function. This function loads the data from a CSV file and returns a Pandas DataFrame.

In [2]:
raw_data = load_raw_dataset()

### Convert date to datetime
This function takes the raw dataset as input and converts any date columns to datetime format.

In [None]:
processed_data = date_to_datetime(raw_data)

### Handle missing values
The `handle_missing_values` function handles missing values in the dataset. It Drop rows with empty headline, url, date, stock columns and Replace empty row publisher column with Unknown.

In [None]:
# Call the handle_missing_values function on the processed dataset
processed_data = handle_missing_values(processed_data)

### Standardize text
The `standardize_text` function is used to standardize text data in the dataset. It performs tasks such as converting text to lowercase, removing special characters or punctuation.

In [None]:
# Call the standardize_text function on the processed dataset
processed_data = standardize_text(processed_data)

### Handle duplicates
The `handle_duplicates` function deals with duplicate records in the dataset. It Check for duplicate rows based on the url column & Keep only the row with the latest date compared to all duplicate rows.

In [None]:
# Call the handle_duplicates function on the processed dataset
processed_data = handle_duplicates(processed_data)

### Save processed dataset
The `save_processed_dataset` function allows to save the cleaned and preprocessed dataset to a CSV file. Create the output folder if it doesn't exist and Save the cleaned dataset.

In [None]:
# Save the processed dataset
output_folder = os.path.join('..', 'data', 'processed') 
save_processed_dataset(processed_data, output_folder)

## Descriptive Statistical Data Analysis

In this section, we define functions that perform Descriptive statistical data analysis tasks on a given dataset.


### Text length stats

This function calculates various statistics related to the length of the headlines in the given dataset. It takes the dataset as input and returns a dictionary with the following statistics:

- `mean`: The mean length of the headlines.
- `median`: The median length of the headlines.
- `minimum`: The minimum length of the headlines.
- `maximum`: The maximum length of the headlines.
- `standard_deviation`: The standard deviation of the lengths of the headlines.


In [None]:
def get_text_length_stats(data):
 
    headline_lengths = data['headline'].str.len()
    stats = {
        'mean': headline_lengths.mean(),
        'median': headline_lengths.median(),
        'minimum': headline_lengths.min(),
        'maximum': headline_lengths.max(),
        'standard_deviation': headline_lengths.std()
    }
    return stats

## Inferential Statistical Data Analysis

In this section, we define functions that perform Inferential statistical data analysis tasks on a given dataset.

### Function: `get_text_length_stats(data)`