# Data Story : COP26 Statements

#### The break down of the data story:

For this assignment, I am analyzing statements made by representatives from countries at COP26. With this data, I want to see what the overall sentiment was at, and how often statements referenced the major goals of COP26. For example, I want to see how many times fossil fuels are mentioned (especially since this was a touchy subject at the end of COP26) from each category (Global North countries, Global South countries, and final report). Additionally, I want to see (using the bottom-up method) which themes/ topics are addresses more (i.e., energy). By looking for this, I think I'll be able to get a good idea of what the climate priorities are for each category.

When I collected the data, I categorized the statements into two categories: Global North and Global South. The statements analyzed are only the ones that were in English and had an accessible pdf file to collect from this webiste: https://unfccc.int/cop26/speeches-and-statements and then the final report was collected here: https://www.bbc.com/news/science-environment-58982445

For total transparency, I downloaded each individual statement and then combined them into larger pdf files based on the categories mentioned above. From there, I converted the three pdf files into csv files, using this website: https://cdkm.com/pdf-to-csv

In [None]:
import pandas as pd
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
%matplotlib inline

In [None]:
import numpy as np                  
import pandas as pd                 
import matplotlib.pyplot as plt     
%matplotlib inline
import seaborn as sns               
from bs4 import BeautifulSoup

In [None]:
import wordcloud

In [None]:
import nltk
from nltk import FreqDist

In [None]:
import re

## Analyzing Statements from the Global North

#### Cleaning Text
This section takes statements made by representatives from the Global North and cleans the text by removing the stopwords and punctuation.

In [None]:
df = pd.read_csv('GN_Statements_EN.csv', header=None)
#this is reading the csv file that contains statements from representatives in the global north

In [None]:
df.columns = ['Text']
#for the aesthetic and naming the column

In [None]:
df

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

from nltk.tokenize import word_tokenize
df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)

#this code comes from Vlad and is the start of removing stopwords from the Global North csv file

In [None]:
df

In [None]:
remove_stopwords = set(stopwords.words('english'))
df['tokenized'] = df['tokenized'].apply(lambda x: [item for item in x if item not in remove_stopwords])

#this code also comes from Vlad

In [None]:
df['tokenized']

In [None]:
df['tokenized'] = df['tokenized'].apply(lambda x: ' '.join(x))

In [None]:
df['tokenized']

In [None]:
df

In [None]:
df['GN_Text_Clean'] = df['tokenized'].str.replace(r"[^a-zA-Z]+", " ").str.strip()
#This regression came from Ramsha and is used to remove punctuation from df['tokenized']

In [None]:
df['GN_Text_Clean']

In [None]:
df.to_csv('GN_Statements_Clean.csv')
#Got this idea from Claudia

In [None]:
GN = pd.read_csv('GN_Statements_Clean.csv', header=0)

In [None]:
GN

In [None]:
GN["GN_Text_Clean"] = GN["GN_Text_Clean"].astype(str)
#Got this from Claudia, here I'm making the words in column GN_Text_Clean go from floaters to string

In [None]:
GN = pd.read_csv('GN_Statements_Clean.csv')

In [None]:
GN

In [None]:
GN.isnull().sum()

In [None]:
GN1 = GN.dropna()

In [None]:
GN1

In [None]:
GN1.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
GN1

#### Themes
This section takes statements made by representatives from the Global North to see what are the most common words mentioned to gage the major themes (i.e., energy).

In [None]:
GN_Themes = Counter(" ".join(GN1.GN_Text_Clean).split())

In [None]:
GN_Themes.most_common(20)

In [None]:
mywordcloud = WordCloud(background_color='white', width=800, height=400).generate(" ".join(GN_Themes))
plt.imshow(mywordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Global North Statements')
plt.savefig('GN.png')

#### Frequency
This section takes statements made by representatives from the Global North to see the frequency of words from the four main goals of COP26: a global net zero, protecting communities, mobilizing finance, and working together. The themes were picked based on this article: https://www.thenationalnews.com/world/uk-news/2021/07/07/key-themes-for-cop26-climate-summit-unveiled/

In the following section, I got my code from http://www.learningaboutelectronics.com/Articles/How-to-find-the-number-of-times-a-word-or-phrase-occurs-in-a-text-in-Python-using-regular-expressions.php

In [None]:
file = open('GN_Statements_EN.csv' , 'r')
read_data = file.read()
per_word = read_data.split()
print('Total Words: ', len(per_word))

# Using this to know how many words are in document, which will become necessary for visualizing the fequency of 
# the specific words below, also used the original document for the original word count

In [None]:
phrase = list(GN1["GN_Text_Clean"])
makeitastring = ' '.join(map(str, phrase))
#here I'm making the column GN_TexT_Clean from GN into a string for regular expressions

In [None]:
makeitastring

### Global Net Zero
First, looking at the goal of reaching a global net zero, I searched for terms (fossil fuels, emissions, renewables, and coal) to see how often the Global North statements fell into this category

In [None]:
patterns = [r'fossil fuels[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)
#using regular expression, I am able to see how often fossil fuels is mentioned
#this was then used for other variations of fossil fuel, fossil, fuels and fuel

In [None]:
patterns = [r'fossil fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'fossil[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there was no result for 'fossils'

In [None]:
patterns = [r'fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'fuels[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'emissions[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'emission[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'renewables[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'renewable[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'coal[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Protecting Communities
Next, looking at protecting communities, I searched for terms (protect and restore) to see how often the Global North statements fell into this category. 

In [None]:
patterns = [r'protect[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'protecting[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'protects[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'restore[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#also searched for restores, restored and restoring, but they were not found in the statements

### Mobilize Finance
Next, looking into moblizing finance, I searched for the term "finance" to see how often the Global North statements fell into this category.

In [None]:
patterns = [r'finance[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'finances[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'financing[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Working Together
Lastly, looking into working together, I searched for terms (together & collaboration) to see how often the Global North statements fell into this category.

In [None]:
patterns = [r'together[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaboration[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaborate[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaborates[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Using Frequency Distrubition

Here, I wanted to run a frequency distribution in order to plot the most frenquently used terms. I thought this could have been an interesting visualization for the different goals of COP26. However, I had problems with the code and could not figure out the issue without help.

In [None]:
#Statements = GN1['GN_Text_Clean']
#Statements

#words: list[str] = nltk.word_tokenize('Statements')
#fd = nltk.FreqDist(words)
#frequency distribution comes from this website: https://python.gotrained.com/frequency-distribution-in-nltk/

#fdist = nltk.FreqDist('GN_Clean_Text')
#freqDist = FreqDist('GN_Clean_Text')
#print(freqDist)

#words = freqDist.keys()
#print(type(words))

#words

#print(len(words))

#freDist.plot(11)

#### Sentiment
This section takes statements made by representatives from the Global North to analyze the sentiment behind references of fuels and fossil fuels. In the following section, I got my python code from https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6 and https://realpython.com/python-nltk-sentiment-analysis/

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")
#code comes from Freddy

In [None]:
statements =GN1['Text'].to_list()

In [None]:
from random import shuffle

def is_positive(statement: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(statement)["compound"] > 0

shuffle(statements)
for statement in statements[:10]:
    print(">", is_positive(statement), statement)

In [None]:
GN1['sentiment'] = GN1['GN_Text_Clean'].apply(lambda x: sia.polarity_scores(x)['compound'])
#code comes from Vlad

In [None]:
sia.polarity_scores("random text")['compound']

In [None]:
GN1

In [None]:
GN1.to_csv('GN_Statements_Sentiment.csv')

In [None]:
negative = GN1['sentiment'] <= -0.25
neutral = (GN1['sentiment'] < 0.25) & (GN1['sentiment'] > -0.25)
positive = GN1['sentiment'] >= 0.25
#Code from Wei

In [None]:
GN1[neutral].count()

In [None]:
GN1[positive].count()

In [None]:
GN1[negative].count()

# Analyzing Statements From Global South

#### Cleaning Text
This section takes statements made by representatives from the Global North and cleans the text by removing the stopwords and punctuation.

In [None]:
df = pd.read_csv('GS_Statements_EN.csv', header=None)
#this is reading the csv file that contains statements from representatives in the global south

In [None]:
df.columns = ['Text']
#for the aesthetic and naming the column

In [None]:
df

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

from nltk.tokenize import word_tokenize
df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)

#this code comes from Vlad and is the start of removing stopwords from the Global South csv file

In [None]:
df

In [None]:
remove_stopwords = set(stopwords.words('english'))
df['tokenized'] = df['tokenized'].apply(lambda x: [item for item in x if item not in remove_stopwords])

#this code also comes from Vlad

In [None]:
df['tokenized']

In [None]:
df['tokenized'] = df['tokenized'].apply(lambda x: ' '.join(x))

In [None]:
df['tokenized']

In [None]:
df

In [None]:
df['GS_Text_Clean'] = df['tokenized'].str.replace(r"[^a-zA-Z]+", " ").str.strip()
#This regression came from Ramsha and is used to remove punctuation from df['tokenized']

In [None]:
df['GS_Text_Clean']

In [None]:
df.to_csv('GS_Statements_Clean.csv')
#Got this idea from Claudia

In [None]:
GS = pd.read_csv('GS_Statements_Clean.csv', header=0)

In [None]:
GS

In [None]:
GS["GS_Text_Clean"] = GS["GS_Text_Clean"].astype(str)
#Got this from Claudia, here I'm making the words in column GN_Text_Clean go from floaters to string

In [None]:
GS = pd.read_csv('GS_Statements_Clean.csv')

In [None]:
GS

In [None]:
GS.isnull().sum()

In [None]:
GS1 = GS.dropna()

In [None]:
GS1

In [None]:
GS1.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
GS1

#### Themes
This section takes statements made by representatives from the Global North to see what are the most common words mentioned to gage the major themes (i.e., energy).

In [None]:
GS_Themes = Counter(" ".join(GS1.GS_Text_Clean).split())

In [None]:
GS_Themes.most_common(20)

In [None]:
mywordcloud = WordCloud(background_color='white', width=800, height=400).generate(" ".join(GS_Themes))
plt.imshow(mywordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Global South Statements')
plt.savefig('GS.png')

#### Frequency
This section takes statements made by representatives from the Global South to see the frequency of words from the four main goals of COP26: a global net zero, protecting communities, mobilizing finance, and working together. The themes were picked based on this article: https://www.thenationalnews.com/world/uk-news/2021/07/07/key-themes-for-cop26-climate-summit-unveiled/ 
In the following section, I got my code from http://www.learningaboutelectronics.com/Articles/How-to-find-the-number-of-times-a-word-or-phrase-occurs-in-a-text-in-Python-using-regular-expressions.php

In [None]:
file = open('GS_Statements_EN.csv' , 'r')
read_data = file.read()
per_word = read_data.split()
print('Total Words: ', len(per_word))

# Using this to know how many words are in document, which will become necessary for visualizing the fequency of 
# the specific words below, also used the original document for the original word count

In [None]:
phrase = list(GS1["GS_Text_Clean"])
makeitastring = ' '.join(map(str, phrase))
#here I'm making the column GS_TexT_Clean from GS into a string for regular expressions

In [None]:
makeitastring

### Global Net Zero
First, looking at the goal of reaching a global net zero, I searched for terms (fossil fuels, emissions, renewables, and coal) to see how often the Global South statements fell into this category.

In [None]:
patterns = [r'fossil fuels[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)
#using regular expression, I am able to see how often fossil fuels is mentioned
#this was then used for other variations of fossil fuel, fossil, fuels and fuel

In [None]:
patterns = [r'fossil fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'fossil[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#there were no results for fossils

In [None]:
patterns = [r'fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'fuels[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'emissions[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'emission[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'renewables[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'renewable[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'coal[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Protecting Communities
Next, looking at protecting communities, I searched for terms (protect and restore) to see how often the Global South statements fell into this category. 

In [None]:
patterns = [r'protect[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'protecting[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#there were no results for 'protects'

In [None]:
patterns = [r'restore[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#there were no results for 'restores' or 'restoring'

In [None]:
patterns = [r'restored[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Mobilize Finance
Next, looking into moblizing finance, I searched for the term "finance" to see how often the Global South statements fell into this category.

In [None]:
patterns = [r'finance[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'finances[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'financing[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

### Working Together
Lastly, looking into working together, I searched for terms (together & collaboration) to see how often the Global South statements fell into this category.

In [None]:
patterns = [r'together[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaboration[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaborate[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaborates[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#### Sentiment
This section takes statements made by representatives from the Global South to analyze the sentiment behind references of fuels and fossil fuels. In the following section, I got my python code from https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6 and https://realpython.com/python-nltk-sentiment-analysis/

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")
#code comes from Freddy

In [None]:
statements =GS1['Text'].to_list()

In [None]:
from random import shuffle

def is_positive(statement: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(statement)["compound"] > 0

shuffle(statements)
for statement in statements[:10]:
    print(">", is_positive(statement), statement)

In [None]:
GS1['sentiment'] = GS1['GS_Text_Clean'].apply(lambda x: sia.polarity_scores(x)['compound'])
#code comes from Vlad

In [None]:
sia.polarity_scores("random text")['compound']

In [None]:
GS1

In [None]:
GS1.to_csv('GS_Statements_Sentiment.csv')

In [None]:
negative = GS1['sentiment'] <= -0.25
neutral = (GS1['sentiment'] < 0.25) & (GS1['sentiment'] > -0.25)
positive = GS1['sentiment'] >= 0.25
#Code from Wei

In [None]:
GS1[neutral].count()

In [None]:
GS1[positive].count()

In [None]:
GS1[negative].count()

# Analyzing the Final Report from COP26

#### Cleaning Text
This section takes statements made by representatives from the Global North and cleans the text by removing the stopwords and punctuation.

In [None]:
df = pd.read_csv('COP26_Final_Report.csv', header=None)
#this is reading the csv file that contains the final report from COP26

In [None]:
df.columns = ['Text']
#for the aesthetic and naming the column

In [None]:
df

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

from nltk.tokenize import word_tokenize
df['tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)

#this code comes from Vlad and is the start of removing stopwords from the final report csv file

In [None]:
df

In [None]:
remove_stopwords = set(stopwords.words('english'))
df['tokenized'] = df['tokenized'].apply(lambda x: [item for item in x if item not in remove_stopwords])

#this code also comes from Vlad

In [None]:
df['tokenized']

In [None]:
df['tokenized'] = df['tokenized'].apply(lambda x: ' '.join(x))

In [None]:
df['tokenized']

In [None]:
df

In [None]:
df['FR_Text_Clean'] = df['tokenized'].str.replace(r"[^a-zA-Z]+", " ").str.strip()
#This regression came from Ramsha and is used to remove punctuation from df['tokenized']

In [None]:
df['FR_Text_Clean']

In [None]:
df.to_csv('FR_Statements_Clean.csv')
#Got this idea from Claudia

In [None]:
FR = pd.read_csv('FR_Statements_Clean.csv', header=0)

In [None]:
FR

In [None]:
FR["FR_Text_Clean"] = FR["FR_Text_Clean"].astype(str)
#Got this from Claudia, here I'm making the words in column GN_Text_Clean go from floaters to string

In [None]:
FR = pd.read_csv('FR_Statements_Clean.csv')

In [None]:
FR

In [None]:
FR.isnull().sum()

In [None]:
FR1 = FR.dropna()

In [None]:
FR1

In [None]:
FR1.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
FR1

#### Themes
This section takes statements made by representatives from the Global North to see what are the most common words mentioned to gage the major themes (i.e., energy).

In [None]:
FR_Themes = Counter(" ".join(FR1.FR_Text_Clean).split())

In [None]:
FR_Themes.most_common(20)

In [None]:
mywordcloud = WordCloud(background_color='white', width=800, height=400).generate(" ".join(FR_Themes))
plt.imshow(mywordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Final Report')
plt.savefig('FR.png')

#### Frequency
This section takes statements made by representatives from the final report to see the frequency of words from the four main goals of COP26: a global net zero, protecting communities, mobilizing finance, and working together. The themes were picked based on this article: https://www.thenationalnews.com/world/uk-news/2021/07/07/key-themes-for-cop26-climate-summit-unveiled/

In the following section, I got my code from http://www.learningaboutelectronics.com/Articles/How-to-find-the-number-of-times-a-word-or-phrase-occurs-in-a-text-in-Python-using-regular-expressions.php

In [None]:
file = open('COP26_Final_Report.csv' , 'r')
read_data = file.read()
per_word = read_data.split()
print('Total Words: ', len(per_word))

# Using this to know how many words are in document, which will become necessary for visualizing the fequency of 
# the specific words below, also used the original document for the original word count

In [None]:
phrase = list(FR1["FR_Text_Clean"])
makeitastring = ' '.join(map(str, phrase))
#here I'm making the column GS_TexT_Clean from GS into a string for regular expressions

In [None]:
makeitastring

### Global Net Zero
First, looking at the goal of reaching a global net zero, I searched for terms (fossil fuels, emissions, renewables, and coal) to see how often the final report falls into this category.

In [None]:
patterns = [r'fossil fuels[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)
#using regular expression, I am able to see how often fossil fuels is mentioned
#this was then used for other variations of fossil fuel, fossil, fuels and fuel

In [None]:
patterns = [r'fossil fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'fossil[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there was no result for fossils

In [None]:
patterns = [r'fuel[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

#There was no result for fuels

In [None]:
patterns = [r'emissions[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'emission[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'coal[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there were no results for 'renewables' or 'renewable'

### Protecting Communities
Next, looking at protecting communities, I searched for terms (protect and restore) to see how often the final report falls into this category. 

In [None]:
patterns = [r'protecting[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there were no results for 'protect' or 'protects'

In [None]:
patterns = [r'restoring[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there were no results for 'restore', 'restores' or 'restored'

### Mobilize Finance
Next, looking into moblizing finance, I searched for the term "finance" to see how often the Global North statements fell into this category.

In [None]:
patterns = [r'finance[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there were no results for 'finances' or 'financing'

### Working Together
Lastly, looking into working together, I searched for terms (together & collaboration) to see how often the Global North statements fell into this category.

In [None]:
patterns = [r'together[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

In [None]:
patterns = [r'collaboration[^a-z,A-Z]+']

for p in patterns:
    match = re.findall(p, makeitastring)
    print(match)
    
length= len(match)
print(length)

# there were no results for 'collaborate' or 'collaborates'

#### Sentiment
This section takes statements made by representatives from the Global South to analyze the sentiment behind references of fuels and fossil fuels. In the following section, I got my python code from https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6 and https://realpython.com/python-nltk-sentiment-analysis/

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")
#code comes from Freddy

In [None]:
statements =FR1['Text'].to_list()

In [None]:
from random import shuffle

def is_positive(statement: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(statement)["compound"] > 0

shuffle(statements)
for statement in statements[:10]:
    print(">", is_positive(statement), statement)

In [None]:
FR1['sentiment'] = FR1['FR_Text_Clean'].apply(lambda x: sia.polarity_scores(x)['compound'])
#code comes from Vlad

In [None]:
sia.polarity_scores("random text")['compound']

In [None]:
FR1

In [None]:
FR1.to_csv('FR_Statements_Sentiment.csv')

In [None]:
negative = FR1['sentiment'] <= -0.25
neutral = (FR1['sentiment'] < 0.25) & (FR1['sentiment'] > -0.25)
positive = FR1['sentiment'] >= 0.25
#Code from Wei

In [None]:
FR1[neutral].count()

In [None]:
FR1[positive].count()

In [None]:
FR1[negative].count()

# Visualizations

The visualizations I've decided to include in my story include the wordcloud for each category: Global North, Global South, and the Final Report. Then I created two visualizations using datawrapper because it looked nicer and I was able to compare the different datasets. First, I used pie charts to demonstrate the different percentages of negative, positive or neutral a document was. Second, I used a clustered bar chart to demonstrate the amount of references to one of the four goals made.