# **Data Cleaning & Initial Checks**

---
---

### **Setup & Load the Data**
---

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import pytz
from IPython.display import display
import matplotlib.pyplot as plt
import sys
import os
import seaborn as sns

In [2]:
# Load a CSV file from your local file system
df = pd.read_csv('C:/Users/Admin/OneDrive/10 Academy/Week 1/Technical Content/Data/raw_analyst_ratings.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A


---
### **Data Cleaning**

In [3]:
# Convert 'date' column to datetime if not already
df['date'] = pd.to_datetime(df['date'], errors='coerce', utc=True)

# Filter rows where date is from 2020 onwards
df = df[df['date'] >= '2020-01-01']

# Check the structure, data types, and missing values
print("DataFrame Info:")
df.info()
print("\nMissing Values:\n", df.isnull().sum())

# Preview unique date formats and parse the 'date' column
print("\nSample Dates Before Parsing:")
print(df['date'].head())

# Convert 'date' column to datetime, handling errors and timezones
df['date'] = pd.to_datetime(df['date'], errors='coerce', utc=True)

print("\nSample Dates After Parsing:")
print(df['date'].head())

# Check for any parsing failures (NaT)
print("\nUnparsed Dates (NaT):", df['date'].isna().sum())

# Remove the 'Unnamed: 0' column
if 'Unnamed: 0' in df.columns:
    df.drop(columns=['Unnamed: 0'], inplace=True)

# Remove rows with NaT in 'date' column
df.dropna(subset=['date'], inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Save the cleaned dataframe to a local CSV file
df.to_csv('C:/Users/Admin/OneDrive/10 Academy/Week 1/Technical Content/Data/raw_analyst_ratings_cleaned.csv', index=False)

# Display the first few rows of the cleaned DataFrame
print("\nCleaned DataFrame Preview:")
display(df.head())

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 28392 entries, 0 to 1406315
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   Unnamed: 0  28392 non-null  int64              
 1   headline    28392 non-null  object             
 2   url         28392 non-null  object             
 3   publisher   28392 non-null  object             
 4   date        28392 non-null  datetime64[ns, UTC]
 5   stock       28392 non-null  object             
dtypes: datetime64[ns, UTC](1), int64(1), object(4)
memory usage: 1.5+ MB

Missing Values:
 Unnamed: 0    0
headline      0
url           0
publisher     0
date          0
stock         0
dtype: int64

Sample Dates Before Parsing:
0   2020-06-05 14:30:54+00:00
1   2020-06-03 14:45:20+00:00
2   2020-05-26 08:30:07+00:00
3   2020-05-22 16:45:06+00:00
4   2020-05-22 15:38:59+00:00
Name: date, dtype: datetime64[ns, UTC]

Sample Dates After Parsing:
0   20

Unnamed: 0,headline,url,publisher,date,stock
0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 14:30:54+00:00,A
1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 14:45:20+00:00,A
2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 08:30:07+00:00,A
3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 16:45:06+00:00,A
4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 15:38:59+00:00,A


----

### **Summary of Data Cleaning**

- **Initial Inspection:**  
    The dataset was examined for structure, data types, and missing values, resulting in 28,392 rows and 5 columns after cleaning.

- **Date Parsing:**  
    The 'date' column was standardized to the `datetime64[ns, UTC]` format, ensuring consistent temporal data.

- **Handling Missing Values:**  
    Rows with missing or unparseable dates were removed, leaving only valid entries.

- **Column Cleanup:**  
    The redundant 'Unnamed: 0' column was dropped, streamlining the dataframe.

- **Index Reset:**  
    The index was reset to maintain sequential order after row removals.

- **Data Export:**  
    The cleaned dataframe was saved to a new CSV file for further analysis.

These steps ensured the dataset is consistent, with reliable date formats and no extraneous columns, making it ready for downstream analysis.