# **Data Cleaning & Initial Checks**

---
---

## **Setup & Load the Data**
---

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import re
import pytz
from IPython.display import display
import matplotlib.pyplot as plt
import sys
import os
import seaborn as sns

In [2]:
# Load a CSV file from your local file system
df = pd.read_csv('C:/Users/Admin/OneDrive/10 Academy/Week 1/Technical Content/Data/raw_analyst_ratings.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 10:30:54-04:00,A
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 10:45:20-04:00,A
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 04:30:07-04:00,A
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 12:45:06-04:00,A
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 11:38:59-04:00,A


---
## **Data Cleaning**

In [3]:
# Check the structure, data types, and missing values
print("DataFrame Info:")
df.info()
print("\nMissing Values:\n", df.isnull().sum())

# Preview unique date formats and parse the 'date' column
print("\nSample Dates Before Parsing:")
print(df['date'].head())

# Convert 'date' column to datetime, handling errors and timezones
df['date'] = pd.to_datetime(df['date'], errors='coerce', utc=True)

print("\nSample Dates After Parsing:")
print(df['date'].head())

# Check for any parsing failures (NaT)
print("\nUnparsed Dates (NaT):", df['date'].isna().sum())

# Save the cleaned dataframe to a local CSV file
df.to_csv('C:/Users/Admin/OneDrive/10 Academy/Week 1/Technical Content/Data/raw_analyst_ratings_cleaned.csv', index=False)

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1407328 entries, 0 to 1407327
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1407328 non-null  int64 
 1   headline    1407328 non-null  object
 2   url         1407328 non-null  object
 3   publisher   1407328 non-null  object
 4   date        1407328 non-null  object
 5   stock       1407328 non-null  object
dtypes: int64(1), object(5)
memory usage: 64.4+ MB

Missing Values:
 Unnamed: 0    0
headline      0
url           0
publisher     0
date          0
stock         0
dtype: int64

Sample Dates Before Parsing:
0    2020-06-05 10:30:54-04:00
1    2020-06-03 10:45:20-04:00
2    2020-05-26 04:30:07-04:00
3    2020-05-22 12:45:06-04:00
4    2020-05-22 11:38:59-04:00
Name: date, dtype: object

Sample Dates After Parsing:
0   2020-06-05 14:30:54+00:00
1   2020-06-03 14:45:20+00:00
2   2020-05-26 08:30:07+00:00
3   2020-05-22 16:45:06+00:0

## **Summary Report: Analyst Ratings DataFrame**

This dataset (`df`) contains **1,407,328** rows and **6 columns** related to analyst ratings and stock market news headlines. Below is a summary of its structure and contents:

### **Columns Overview**
- **Unnamed: 0**: Index column from the original CSV (int64).
- **headline**: News headline text (object).
- **url**: Link to the news article (object).
- **publisher**: Name of the news publisher or author (object).
- **date**: Date and time of publication (datetime64[ns, UTC]); contains many missing values.
- **stock**: Stock ticker symbol (object).

### **Data Quality**
- The `date` column has significant missing values: only **55,987** non-null entries out of 1,407,328 (~4% completeness).
- All other columns are fully populated.

### **Memory Usage**
- The DataFrame uses approximately **64.4 MB** of memory.

### **Sample Data**
| headline                                         | publisher         | date                      | stock |
|--------------------------------------------------|-------------------|---------------------------|-------|
| Stocks That Hit 52-Week Highs On Friday          | Benzinga Insights | 2020-06-05 14:30:54+00:00 | A     |
| 71 Biggest Movers From Friday                    | Lisa Levin        | 2020-05-26 08:30:07+00:00 | A     |
| B of A Securities Maintains Neutral on Agilent...| Vick Meyer        | 2020-05-22 15:38:59+00:00 | A     |

### **Key Observations**
- The dataset is large and covers a wide range of news headlines and publishers.
- The `date` field will require attention due to its high rate of missing values.
- The data is ready for further cleaning, analysis, or feature engineering.