
#**Data Cleaning and Preprocessing for NLP**
    - Cleantech Media Dataset (MAHA)
    -


This notebook serves as a comprehensive guide to managing and analyzing media data related to clean technology. We will cover various data manipulation tasks including examining dataset dimensions, viewing sample data, and cleaning the dataset for further analysis.


## **Setting up the Environment**

Before we begin any data analysis, we need to set up our working environment. This includes importing necessary libraries and ensuring our dataset is loaded correctly.

In [None]:
# !pip install nltk

import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer




In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/CLT/

 ## **Import Libraries and Load Data**

 After setting up the environment and navigating to the correct directory, lets load the datasets

In [None]:
# Paths to the datasets
cleantech_media_path = '/content/drive/My Drive/CLT/cleantech_media_dataset_v2_2024-02-23.csv'


 # Loading the datasets
cleantech_media_data = pd.read_csv(cleantech_media_path)

## **Understanding the Dataset Structure**

Display the Initial Shape of the Dataset
To get an understanding of the size and scope of our dataset, we first look at its shape, which tells us the number of rows and columns.

In [None]:
# Display the initial shape of the datasets
print("Initial Cleantech Media Data shape:", cleantech_media_data.shape)

Initial Cleantech Media Data shape: (9593, 7)


This output informs us that the dataset contains 9,593 entries across 7 different columns. This indicates a robust dataset with multiple dimensions to analyze.

## **Data Inspection**

To gain a thorough understanding of our dataset quickly, we can perform several inspections at once: viewing the first few entries, examining the column names, and checking the content of the first article. These steps are crucial for getting familiar with the dataset's structure and contents.

In [None]:
# Display the first few rows of the media dataset
print(cleantech_media_data.head())


   Unnamed: 0                                              title        date  \
0        1280  Qatar to Slash Emissions as LNG Expansion Adva...  2021-01-13   
1        1281               India Launches Its First 700 MW PHWR  2021-01-15   
2        1283              New Chapter for US-China Energy Trade  2021-01-20   
3        1284  Japan: Slow Restarts Cast Doubt on 2030 Energy...  2021-01-22   
4        1285     NYC Pension Funds to Divest Fossil Fuel Shares  2021-01-25   

  author                                            content       domain  \
0    NaN  ["Qatar Petroleum ( QP) is targeting aggressiv...  energyintel   
1    NaN  ["• Nuclear Power Corp. of India Ltd. ( NPCIL)...  energyintel   
2    NaN  ["New US President Joe Biden took office this ...  energyintel   
3    NaN  ["The slow pace of Japanese reactor restarts c...  energyintel   
4    NaN  ["Two of New York City's largest pension funds...  energyintel   

                                                 url  
0  http

- The first few rows give us an immediate sense of the dataset, including the types of columns (e.g., title, date, content) and some entries that contain missing values (NaN) in the 'author' field.
- he 'content' field appears to be in a list format, suggesting that it might require cleaning or transformation before further analysis.
- Each row contains detailed information about different media articles, providing a rich source for analysis. The 'domain' column indicates the source of the information, and the 'url' provides a direct link to the original publication.
- The presence of NaN values in the 'author' column highlights a common issue in real-world data that will need to be addressed during data cleaning.

## **Understanding Dataset Columns and Data Types**

In this part of the analysis, we display the names of all columns to understand what data we have. We also check the content of the first row to get a sense of what kind of text data is stored in the 'content' column, and we verify the data types to ensure data consistency.

In [None]:
# Display the column names
print(cleantech_media_data.columns, "\n")

# Check the content of the first row
print(cleantech_media_data["content"][0], "\n")

# Check the type of the content column
print(cleantech_media_data["content"].apply(type).value_counts())

Index(['Unnamed: 0', 'title', 'date', 'author', 'content', 'domain', 'url'], dtype='object') 

["Qatar Petroleum ( QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. In its latest Sustainability Report published on Wednesday, QP said its goals include `` reducing the emissions intensity of Qatar's LNG facilities by 25% and of its upstream facilities by at least 15%. '' The company is also aiming to reduce gas flaring intensity across its upstream facilities by more than 75% and has raised its carbon capture and storage ambitions from 5 million tons/yr to 7 million tons/yr by 2027. About 2.2 million tons/yr of the carbon capture goal will come from the 32 million ton/yr Phase 1 of the LNG expansion, also known as the North Field East project. A further 1.1 million tons/yr will come from Phase 2, known as the North Field South project, which will raise Qatar's LNG capacity by a further 16

- Column Names: The output lists all the columns in the dataset, including 'title', 'date', 'author', 'content', 'domain', and 'url'. This helps identify what kinds of information each column holds, which is essential for deciding how to handle each one during the analysis.
- First Row Content: The content of the first row provides insight into the detailed textual data stored in the 'content' column. It includes comprehensive information about corporate strategies and initiatives, which are critical for analyses related to business intelligence, policy making, or market trends.
- Data Type Verification: The data type of the 'content' column is confirmed to be <class 'str'>, indicating that all entries are stored as strings. This uniformity is crucial for text processing tasks, ensuring that methods like string manipulation or natural language processing can be applied directly without additional type conversion.

## **Data Cleaning Preparations: Identifying Missing Values and Duplicates**

As part of our data cleaning process, it is crucial to first identify any missing values and duplicate entries. These steps are foundational for ensuring data quality and reliability for further analysis.

In [None]:
# Print missing values in each column
print("Missing values before cleaning:")
print(cleantech_media_data.isnull().sum())

# Check for duplicates in the media data
print("Duplicates in media data:", cleantech_media_data.duplicated().sum())


Missing values before cleaning:
Unnamed: 0       0
title            0
date             0
author        9562
content          0
domain           0
url              0
dtype: int64
Duplicates in media data: 0


- Missing Values:
Unnamed: 0, title, date, content, domain, url: These columns have zero missing values, indicating that our dataset is consistently populated in most fields which are crucial for further analysis.
    - author: There are 9,562 missing values in the 'author' column, which is a significant number. This suggests that most of the entries in this dataset do not have an associated author. Depending on the goals of your analysis, you might choose to ignore this column if authorship is not relevant, or you may need to address these missing values, possibly by imputing data or by acknowledging this gap in any data-driven conclusions.

- Duplicates:
The check for duplicates returned a count of zero, indicating that there are no duplicate rows within the dataset. This is an excellent sign as it means each entry is unique and will contribute individual insights to your analysis. I

## **Cleaning Text Data in the 'Content' Column**
Sometimes, data can be stored in formats that are not immediately usable for analysis, such as text stored as string representations of lists. Here, we'll convert these strings back into actual text, making them easier to work with for text analysis.

In [None]:
# Convert the string representation of list in 'content' column to actual list and then join as string
cleantech_media_data['content'] = cleantech_media_data['content'].apply(lambda x: ' '.join(ast.literal_eval(x)))

# Show the cleaned 'content' column for the first few entries
cleantech_media_data['content'].head()

0    Qatar Petroleum ( QP) is targeting aggressive ...
1    • Nuclear Power Corp. of India Ltd. ( NPCIL) s...
2    New US President Joe Biden took office this we...
3    The slow pace of Japanese reactor restarts con...
4    Two of New York City's largest pension funds s...
Name: content, dtype: object

Effective Conversion: The use of ast.literal_eval() successfully converted string representations of lists into actual list objects, which were then joined into coherent, continuous strings. This process ensured that each entry in the 'content' column is now in a straightforward text format, free from the complexities of list notations.
With the content now presented as clean, uninterrupted text, the dataset is more suitable for text analysis.

## **Advanced Data Cleaning**
As we progress with cleaning the dataset, it is essential to remove any unnecessary or redundant information, ensuring the data is as relevant and concise as possible for analysis.

In [None]:

# Drop rows with missing values in 'content' column
cleantech_media_data.dropna(subset=["content"], inplace=True)

# Remove duplicates
cleantech_media_data = cleantech_media_data.drop_duplicates()

# Drop unnecessary 'Unnamed: 0' and 'author' columns
cleantech_media_data.drop(columns=['Unnamed: 0'], inplace=True)
cleantech_media_data.drop('author', axis=1, inplace=True)


- Removing Missing Data: By removing rows with missing 'content', we ensure that our analysis is based only on complete data, improving the reliability of our results.
- Eliminating Duplicates: Removing duplicates helps in preventing any skew or bias that could affect the outcome of our analysis. It is essential for maintaining the dataset’s integrity.
- Simplifying the Dataset: Dropping unnecessary columns helps focus the dataset on relevant data only. Removing the 'Unnamed: 0' column simplifies the DataFrame as this column is generally an artifact of the data loading process and does not contain useful information. Similarly, dropping the 'author' column, particularly because it contains a high number of missing values, reduces the complexity and potential noise within the dataset.

These steps collectively enhance the quality of the dataset and prepare it for more effective data analysis, ensuring that the data is not only clean but also focused and relevant to the tasks at hand.

## **Converting Date Columns to Datetime Format**

Handling dates correctly in a dataset is crucial for many types of analysis, especially when time trends are involved. Converting date columns to a proper datetime format allows for more accurate and efficient operations on these data.




In [None]:
cleantech_media_data['date'] = pd.to_datetime(cleantech_media_data['date'])
print(cleantech_media_data.dtypes)


title              object
date       datetime64[ns]
content            object
domain             object
url                object
dtype: object


- Datetime Conversion: The 'date' column has been successfully converted to datetime64[ns], the standard datetime format in pandas. This format will help in performing any date-time specific operations like sorting, filtering by date, time-series analysis, etc.

- Data Types Verification: The data types output confirms that all other columns retain their original data type while the 'date' column is now appropriately formatted for any temporal analysis.

## **Final Verification of Data Structure and Types**

After all our data cleaning and transformation steps, it is crucial to perform a final check on the data structure and types to ensure everything is in order and ready for analysis. This final verification helps confirm that the dataset is correctly formatted and that all changes have been properly implemented.



In [None]:
# Final verification of data structure and types
print(cleantech_media_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9593 entries, 0 to 9592
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   title    9593 non-null   object        
 1   date     9593 non-null   datetime64[ns]
 2   content  9593 non-null   object        
 3   domain   9593 non-null   object        
 4   url      9593 non-null   object        
dtypes: datetime64[ns](1), object(4)
memory usage: 374.9+ KB
None


- The output confirms that the DataFrame has a total of 9,593 entries, matching the original dataset count, which means no data was lost unintentionally through the cleaning and transformation processes.
- Each column shows a 'Non-Null Count' equal to the total number of entries, indicating there are no missing values in these columns after our cleaning steps.
- The data types are correctly listed, with the 'date' column successfully converted to datetime64[ns] and other columns as object. This ensures that the data is formatted correctly for further analysis, particularly with the 'date' column now optimized for time-series and chronological analyses.
- The memory usage is listed, which provides insight into the dataset's size and can help in assessing computational resource needs for further data processing and analysis.

# **Text Preprocessing**

## **Text Preprocessing Setup**

Text preprocessing is a crucial step in any natural language processing (NLP) workflow. It involves setting up the necessary tools and transforming raw text into a clean and usable format. Here, we ensure that NLTK's tokenizers, stopwords, and lemmatizers are ready for text processing tasks.



In [None]:
# Ensure that NLTK's tokenizers and stopwords data are available
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## **Defining a Text Preprocessing Function**
Properly preparing text data is crucial for most NLP tasks. This function is designed to standardize and simplify text, making it more amenable to analysis. The preprocessing steps include converting to lowercase, removing punctuation, tokenizing, removing stopwords, and lemmatizing.

In [None]:
def preprocess_text(text):
    """
    Function to preprocess text data by lowering case, removing punctuation,
    tokenizing, removing stopwords, and lemmatizing.
    """
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

- Lowercasing is the first step in text preprocessing, used to treat words such as "Hello," "hello," the first step in text preprocessing, used to treat words such as "Hello", "hello", and "HELLO" as the same word.
- Removing Punctuation cleans the text, removing characters that could interfere with textual analysis (like punctuation and special characters), which aren't usually needed for understanding the meaning of text.
- Tokenization by breaking the text into individual elements (words), we can apply further processing like stopword removal and lemmatization.
- Stopwords are common words that typically don't contribute significantly to the meaning of a sentence (e.g., "the", "is", and "at"). Removing them helps focus on the important information.
- Lemmatization reduces words to their base or root form, helping to consolidate different forms of a word into a single item (e.g., "running", "ran", and "runs" become "run").

## **Applying Text Preprocessing to Dataset Columns**
After defining the text preprocessing function, it's crucial to apply this function to the relevant columns in our dataset. This ensures that the text data within these columns is clean and standardized, making it suitable for further analysis, such as feature extraction, sentiment analysis, or Text modeling.

In [None]:
# Preprocess 'title' and 'content' columns in the media dataset
cleantech_media_data['title_preprocessed'] = cleantech_media_data['title'].apply(preprocess_text)
cleantech_media_data['content_preprocessed'] = cleantech_media_data['content'].apply(preprocess_text)

# Displaying a sample of the preprocessed data
print("Media Data Preprocessed Sample:")
print(cleantech_media_data[['title', 'title_preprocessed', 'content', 'content_preprocessed']].head())

Media Data Preprocessed Sample:
                                               title  \
0  Qatar to Slash Emissions as LNG Expansion Adva...   
1               India Launches Its First 700 MW PHWR   
2              New Chapter for US-China Energy Trade   
3  Japan: Slow Restarts Cast Doubt on 2030 Energy...   
4     NYC Pension Funds to Divest Fossil Fuel Shares   

                           title_preprocessed  \
0  qatar slash emission lng expansion advance   
1                  india launch first mw phwr   
2            new chapter uschina energy trade   
3  japan slow restarts cast doubt energy plan   
4   nyc pension fund divest fossil fuel share   

                                             content  \
0  Qatar Petroleum ( QP) is targeting aggressive ...   
1  • Nuclear Power Corp. of India Ltd. ( NPCIL) s...   
2  New US President Joe Biden took office this we...   
3  The slow pace of Japanese reactor restarts con...   
4  Two of New York City's largest pension funds s...   


The output shows both the original and preprocessed versions of the 'title' and 'content' columns. This allows for a clear comparison to see how the text has been simplified and standardized.

- Titles: The preprocessing removes all non-alphabetic characters, lowercases the text, and removes stopwords. This simplifies the titles, focusing only on the key words.
- Content: Similar transformations are applied to the content, which now ignores irrelevant punctuation and common words, and normalizes the text for further analysis.

## **Final Verification and Overview**
Before concluding our data preprocessing tasks, it's good practice to perform a final check to ensure all transformations have been applied correctly to the dataset.


In [None]:
print(cleantech_media_data.head())


                                               title       date  \
0  Qatar to Slash Emissions as LNG Expansion Adva... 2021-01-13   
1               India Launches Its First 700 MW PHWR 2021-01-15   
2              New Chapter for US-China Energy Trade 2021-01-20   
3  Japan: Slow Restarts Cast Doubt on 2030 Energy... 2021-01-22   
4     NYC Pension Funds to Divest Fossil Fuel Shares 2021-01-25   

                                             content       domain  \
0  Qatar Petroleum ( QP) is targeting aggressive ...  energyintel   
1  • Nuclear Power Corp. of India Ltd. ( NPCIL) s...  energyintel   
2  New US President Joe Biden took office this we...  energyintel   
3  The slow pace of Japanese reactor restarts con...  energyintel   
4  Two of New York City's largest pension funds s...  energyintel   

                                                 url  \
0  https://www.energyintel.com/0000017b-a7dc-de4c...   
1  https://www.energyintel.com/0000017b-a7dc-de4c...   
2  https://www

The first few rows show the preprocessed 'title' and 'content' along with the original columns, confirming that our data transformations have been applied throughout the dataset.

## **Saving the Preprocessed Data**
After verifying that the data is clean and correctly formatted, it is crucial to save the processed data to a file for future use or further analysis.



In [None]:
# Path to the NLP folder on your desktop

# Path to the NLP folder on your desktop
path_to_save = '/content/drive/My Drive/CLT/cleantech_media_dataset_cleaned.csv'

# Save the DataFrame to CSV
cleantech_media_data.to_csv(path_to_save, index=False)


By saving the DataFrame to a CSV file, we ensure that all the preprocessing steps are preserved. This file can now be used for further analysis, machine learning models, or reporting without needing to reapply preprocessing steps.