# Data Cleaning

## Introduction

This Notebook starts with data cleaning, it is necessary step of any Data Science project. With dirty Data into a model can noone will have meaningful results. The data we give to our model should be cleared beforehand to get the best results.

Specifically, we'll be walking through:

1. **Getting the data - **manually find from different sources.
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

The millennium was a period in which technology and democracy were expected to reach a certain level all over the world.

But it is very difficult to say that all communities have the same level of prosperity and the benefits of democracy.

In this work, I will analyze the words used in the speeches of the leaders of various countries.

Of course, this is a data science work. I will leave it to political scientists to discuss the results of this study.

## Getting The Data

Unfortunately, there are no transcripts of leaders in a single source. I will analyze the English texts I copied from different sources.

In [1]:
import glob   # for iteration in Txt files.
import re     # to get the names of presidents for making dataset from texts.

In [2]:
presidents = []
speeches = []

dict = {}

pattern = r'raw_data/(\w+)\.*'

for filename in glob.glob('raw_data/*.txt'):
    leaders_name = re.search(pattern, filename)
    with open(filename, 'r') as text:
        text = text.read()
    
    dict[leaders_name[1]] = text

In [3]:
next(iter(dict.keys())) #For Example

'Trudeau'

In [4]:
next(iter(dict.values()))

"Before I begin, it won’t surprise you that my remarks today will address the importance of progressive values in the context of globalization.\n\nOn that note, today, I am pleased to announce that Canada and the ten other remaining members of the Trans-Pacific Partnership concluded discussions in Tokyo, Japan, on a new Comprehensive and Progressive Agreement for Trans-Pacific Partnership (CPTPP).\n\nThe agreement reached in Tokyo today is the right deal. Our government stood up for Canadian interests and this agreement meets our objectives of creating and sustaining growth, prosperity and well-paying middle-class jobs today and for generations to come.\n\nWe are pleased with the progress we have made to make this deal more progressive and stronger for Canadian workers on intellectual property, culture and the automotive sector\u200e.\n\nTrade helps strengthen the middle class, but for it to work we must ensure that the benefits are shared with all our citizens, not just the few. The C

### At this point we have a dictionary that has the names and speeches of the Presidents as key value pairs.

***

### Now i want to have csv file from my dictionary, so i can save it as file and with pandas work on it .

In [5]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ''.join(str(list_of_text))
    return combined_text

In [6]:
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in dict.items()}

In [7]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data = pd.DataFrame.from_dict(data_combined).transpose()
data.columns = ['speeches']
data = data.sort_index()
data

Unnamed: 0,speeches
Erdogan,"I wholeheartedly greet our 81 provinces as well as sister and friendly capitals and cities of the world from Ankara, from the AK Party headquarter..."
Macron,"Dear friends,\n\nWe are here for this General Assembly, with the Secretary-General having chosen what is such an important topic, the climate, and..."
Merkel,"Ladies and gentlemen,\n\nNearly 50 years ago, Walter Hallstein, the former German Commission President, referred to European integration as an “en..."
Nicolas_Maduro,"Ambassadors, heads of delegations of the countries members of the United Nations Organization, President-elect of the General Assembly, Mrs. Maria..."
Obama,"\nMy fellow citizens:\n\nI stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices bor..."
Putin,"Citizens of Russia, members of the Federation Council and State Duma,\n\nToday's Address is a very special landmark event, just as the times we ar..."
Trudeau,"Before I begin, it won’t surprise you that my remarks today will address the importance of progressive values in the context of globalization.\n\n..."
Trump,"""Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and my fellow Americans:\n\nWe meet tonight at a mom..."
Xi_Jinping,"Ladies and Gentlemen,\n\nFriends,\n\nSeptember has just set in Beijing, bringing with it refreshing breeze and picturesque autumn scenery. And we ..."
Yoshihide_Suga,"I am truly delighted to meet you all, students of the Vietnam-Japan University (VJU), which is a symbol of human resource development projects bet..."


In [8]:
# Let's take a look at an example speech
data.speeches.loc['Merkel']

'Ladies and gentlemen,\n\nNearly 50 years ago, Walter Hallstein, the former German Commission President, referred to European integration as an “endeavour of unparalleled boldness”. The year was 1969, ten years before the first direct European elections took place. At the time, the European Community was still in its infancy. Many of our greatest accomplishments – Schengen, the single market and a single currency – were still mere visions. But the foundation had been laid. And with that, after centuries marked by wars and dictatorships, the citizens of Europe for the first time held out hope for a sustainable, peaceful and stable future.\n\nToday, I am pleased and grateful to be standing in front of the largest democratic parliament in the world. Together, you – 751 Members, elected in 28 member states – represent more than 500 million people. That is nearly seven percent of the world’s population. In your House, we can feel the heart of European democracy beating. The debates, which a

***

### In this step i want to clean data by using regular expressions in order to take optimal result.

In [11]:
# Apply text cleaning techniques with re
import re
import string

def clean_data(speech):
    # Make text lowercase
    speech = speech.lower() 
    
    # remove text in square brackets
    speech = re.sub('\[.*?\]', '', speech) 
    
    # remove punctuation
    speech = re.sub('[%s]' % re.escape(string.punctuation), '', speech) 
    
    # remove words containing numbers
    speech = re.sub('\w*\d\w*', '', speech)
    
    # Get rid of some additional punctuation
    speech = re.sub('[‘’“”…]', '', speech)
    
    # at last get rid of remaining non-sensical text.
    speech = re.sub('[\n]', ' ', speech) 
    
    return speech

clearance = lambda x: clean_data(x)

In [12]:
# Let's take a look at the new cleaned version of text
cleaned = pd.DataFrame(data.speeches.apply(clearance))
cleaned

Unnamed: 0,speeches
Erdogan,i wholeheartedly greet our provinces as well as sister and friendly capitals and cities of the world from ankara from the ak party headquarters i...
Macron,dear friends we are here for this general assembly with the secretarygeneral having chosen what is such an important topic the climate and we had...
Merkel,ladies and gentlemen nearly years ago walter hallstein the former german commission president referred to european integration as an endeavour o...
Nicolas_Maduro,ambassadors heads of delegations of the countries members of the united nations organization presidentelect of the general assembly mrs maria fern...
Obama,my fellow citizens i stand here today humbled by the task before us grateful for the trust you have bestowed mindful of the sacrifices borne by ...
Putin,citizens of russia members of the federation council and state duma todays address is a very special landmark event just as the times we are livi...
Trudeau,before i begin it wont surprise you that my remarks today will address the importance of progressive values in the context of globalization on th...
Trump,madam speaker mr vice president members of congress the first lady of the united states and my fellow americans we meet tonight at a moment of un...
Xi_Jinping,ladies and gentlemen friends september has just set in beijing bringing with it refreshing breeze and picturesque autumn scenery and we are so d...
Yoshihide_Suga,i am truly delighted to meet you all students of the vietnamjapan university vju which is a symbol of human resource development projects between ...


In [14]:
cleaned.speeches['Merkel']

'ladies and gentlemen  nearly  years ago walter hallstein the former german commission president referred to european integration as an endeavour of unparalleled boldness the year was  ten years before the first direct european elections took place at the time the european community was still in its infancy many of our greatest accomplishments – schengen the single market and a single currency – were still mere visions but the foundation had been laid and with that after centuries marked by wars and dictatorships the citizens of europe for the first time held out hope for a sustainable peaceful and stable future  today i am pleased and grateful to be standing in front of the largest democratic parliament in the world together you –  members elected in  member states – represent more than  million people that is nearly seven percent of the worlds population in your house we can feel the heart of european democracy beating the debates which are held in  languages are a sign of this your 

In [15]:
# Let's make cleaned data csv file for analyse later
import csv
cleaned.to_csv("corpus.csv")

### Word Matrix of Speeches

Be able to analyse speeches i want to in this step 'Tokenize' all words.

We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [17]:
# We are going to create a word matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(cleaned.speeches)
data_words = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_words.index = cleaned.index
data_words

Unnamed: 0,abandoned,abdullah,abe,ability,abject,abkhazians,able,abm,aboard,abolish,...,zelensky,zero,zone,zones,àtokio,ça,économique,ëapplied,ìgo,şah
Erdogan,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Macron,0,1,0,1,0,0,4,0,0,0,...,1,2,0,0,0,0,0,0,0,0
Merkel,0,0,0,0,0,0,6,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Nicolas_Maduro,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Obama,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Putin,0,0,0,3,0,0,8,3,0,0,...,0,0,1,0,0,0,0,1,1,0
Trudeau,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,2,0,0,0
Trump,0,0,0,0,1,0,3,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Xi_Jinping,0,0,0,0,0,0,0,0,1,0,...,0,0,0,2,0,0,0,0,0,0
Yoshihide_Suga,0,0,3,1,0,0,2,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [18]:
# Let's make word matrix again csv file
data_words.to_csv("data_words.csv")