#### Data Analysis, Cleansing, and Normalization

This notebook guides you through analyzing, cleansing, and normalizing a dataset using Python.

- Analyze the dataset.
- Cleanse missing or incorrect data.
- Normalize data into a consistent format.

Let's begin!

#### Step 1: Data Analysis

In this step, we will load the dataset and perform some basic analysis to understand its structure and contents.

1. Load the dataset.
2. Display the first few rows.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/drive/datasets/climatebert-netzero-reduction-data.csv')

#display the first few rows of the dataset
data.head()

#### Step 2: Data Cleansing

Data cleansing involves identifying and correcting or removing incorrect, incomplete, or irrelevant data. In this step, we'll:
1. Check for missing values.
2. Remove rows with missing values in the 'text' column.

In [None]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Remove rows with missing values in the 'text' column
data_cleaned = data.dropna(subset=['text'])

data_cleaned.head()

#### Step 3: Data Normalization

Data normalization involves transforming data into a consistent format.
1. Convert all text in the 'text' column to lowercase.
2. Remove special characters and numbers.


In [None]:
import re

# Convert all text to lowercase
data_cleaned['text'] = data_cleaned['text'].str.lower()

# Remove special characters and numbers
data_cleaned['text'] = data_cleaned['text'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

data_cleaned.head() # Display the normalized dataset

#### Step 4: Counting Occurrences

Now that we have cleaned and normalized the data, let's count the occurrences of the terms 'percent' and 'carbon dioxide' in the 'text' column.


In [None]:
# Count occurrences of 'percent' and 'carbon dioxide' in 'text' column
count_percent = data_cleaned['text'].str.contains('percent').sum()
count_carbon_dioxide = data_cleaned['text'].str.contains('carbon dioxide').sum()

# Display the results
print(f"Occurrences of 'percent': {count_percent}")
print(f"Occurrences of 'carbon dioxide': {count_carbon_dioxide}")

In [None]:

# Replace occurrences of 'percent' with '%' and 'carbon dioxide' with 'CO2'
data_cleaned['text'] = data_cleaned['text'].str.replace('percent', '%')
data_cleaned['text'] = data_cleaned['text'].str.replace('carbon dioxide', 'CO2')

# Display the updated dataset
data_cleaned.head()


In [None]:

# Recount occurrences after replacement
count_percent_replaced = data_cleaned['text'].str.contains('%').sum()
count_co2_replaced = data_cleaned['text'].str.contains('CO2').sum()

# Display the results
print(f"Occurrences of '%': {count_percent_replaced}")
print(f"Occurrences of 'CO2': {count_co2_replaced}")


#### Conclusion

In this notebook, we learned how to:
1. Analyze a dataset to understand its structure.
2. Cleanse the data by handling missing values.
3. Normalize the data for consistency.
4. Count specific term occurrences

These are all typical steps in data processing for preparing data for further analysis or machine learning.

Great job!
