# Step 1:

## Data Loading, Processing and Plotting

We import the packages we are going to use. Each package is a tool within the Python ecosystem that someone has created that allow us to do something new without coding it ourselves (such as image analysis).

In the next example, we are going to load an Excel CSV file and analyse it. So we will need the following packages; Pandas and Numpy.

Pandas is our Dataframe package, numpy is our Numerical Python package

In [None]:
import pandas
import numpy

Example 1: we are going to load a dataframe into python and display it

In [None]:
# Load the csv file

df = pandas.read_csv("Data/Example.csv")

# Show first two rows (Notice the 2)
df.head(2)

# Notice that we didn't create a variable for this operation displaying the first tw rows, so it doesn't exist in our variable explorer.

Examine the data by column:

In [None]:
df.info()

Count the Number of Missing Values in each column:

In [None]:
df.isnull().sum()

In [None]:
# Let's look at variable assignment and displaying the dataframe.

LastTwoRows = df.tail(2)

# This doesn't print as we are assigning a variable. But we can display the variable this way:

print(LastTwoRows)

# Or

LastTwoRows

# Notice the difference in Display

Example 2: We are going to filter a dataframe

In [None]:
# We only want data for Turkey

TurkeyRow = df.loc[df['Country'] == "Turkey"]

# Here we use .loc and filter by the column name 'Country'

TurkeyRow

### Lambda Functions

Lambda Functions are functions without a name. It is typically used for row-wise filtering logic. Here we will round the Population column row by row.

In [None]:
# Note we wrote Country differently here, instead of square brackets and quotation marks. We use quotation marks if there is a space in the column name

RoundedCountryPopulation = df['Population (Millions)'].apply(lambda x: round(x,1))

RoundedCountryPopulation

Note that in Python, there are many ways of doing things and that it is quicker to round the whole column in one go. In this case as the workload is so small we won't see a difference, unless your computer is quite old:

In [None]:
RoundedCountryPopulation = round(df['Population (Millions)'], 1)

RoundedCountryPopulation

### Loops

Loops in python allow us to, like lambda functions, allow us to iterate and repeat an operation on data. Here, we will only round the Population if the country name is Turkey:

In [None]:
# We copy the data so we don't affect the original data
RoundedCountryPopulation = df.copy()

for i in range(0, len(RoundedCountryPopulation)):

    x = RoundedCountryPopulation.Country.loc[i]

    if x == 'Turkey':

        # Print Turkey if the Country is Turkey
        print(x)

        # Round the Value for Turkey's population
        RoundedCountryPopulation.loc[i, 'Population (Millions)'] = round(RoundedCountryPopulation.loc[i, 'Population (Millions)'], 0)

    else:

        # Do nothing for any other country

        pass

print('Population of Turkey has been Rounded to the nearest whole number')

RoundedCountryPopulation

Notice if we write this in the same way as a Lambda Function, it's much more concise:

In [None]:
RoundedCountryPopulation = df[['Country', 'Population (Millions)']].apply(lambda x: round(x['Population (Millions)'],0) if x.Country == 'Turkey' else x['Population (Millions)'], axis=1)

RoundedCountryPopulation

### Replacing Text

Sometimes with Data you'll need to replace text that is incorrect in the data. Here, we change Iran to Tunisia.

In [None]:
df.Country = df.Country.replace('Iran', 'Tunisia')

df

We can replace characters as well, row by row.

In [None]:
# Here we won't assign a variable and go row by row, see the for which represents a loop.

[x.replace("a", '-') for x in df.Country]

Let's utilize these skills to graph data we've been using.

In [None]:
ForeignMinistryStatements = pandas.read_csv('Data/Turkish Foreign Ministry Statements - Press Releases - KRG.csv')

ForeignMinistryStatements.head(5)

We are going to graph the number of PKK mentions in MFA statements over time. Therefore we only need the date and the 'Which Group' Column.

In [None]:
ForeignMinistryStatements = ForeignMinistryStatements[['Date (dd/mm/yyyy)', 'Which group?']]

ForeignMinistryStatements.head(5)

For the dates, we need it to be in one common format. We write a filter for both formats, then fill the missing values in the first group with those of the second and use that as the date column

In [None]:
#ForeignMinistryStatements['Date (dd/mm/yyyy)'].replace('-', '/', inplace = True)

Date1 = pandas.to_datetime(ForeignMinistryStatements['Date (dd/mm/yyyy)'], errors='coerce', format='%Y-%m-%d')
Date2 = pandas.to_datetime(ForeignMinistryStatements['Date (dd/mm/yyyy)'], errors='coerce', format='%d/%m/%Y')

ForeignMinistryStatements['Date (dd/mm/yyyy)'] = Date1.fillna(Date2)

ForeignMinistryStatements.head(5)

Now we only want the data for PKK mentions:

In [None]:
ForeignMinistryStatements = ForeignMinistryStatements[(ForeignMinistryStatements['Which group?'] == 'PKK') | (ForeignMinistryStatements['Which group?'] == 'Pkk') | (ForeignMinistryStatements['Which group?'] == 'pkk')]

ForeignMinistryStatements

There are 73 statements. How does this look by year?

In [None]:
ForeignMinistryStatements['Which group?'] = 1

ForeignMinistryStatements.set_index('Date (dd/mm/yyyy)', inplace=True, drop=True)

ForeignMinistryStatements = ForeignMinistryStatements.resample('Y').agg('sum')

ForeignMinistryStatements.head(5)

Let's plot this:

In [None]:
import matplotlib.pyplot as plt
from matplotlib import dates
import seaborn

plt.figure(figsize=(15,8))
ax = seaborn.lineplot(x="Date (dd/mm/yyyy)", y="Which group?", data=ForeignMinistryStatements)

ax.set(xlabel = 'Year', ylabel = 'PKK Mentions')

seaborn.set(font_scale = 1)

ax.set(xticks=ForeignMinistryStatements.index)
ax.xaxis.set_major_formatter(dates.DateFormatter("%Y"))

# Step 2:

## GeoSpatial Data and Plotting

Python is quite useful for graphing quantitative data as well as geospatial data.

In [None]:
# Let's start by importing some conflict data:

Conflict = pandas.read_csv("Data/IraqConflictData.csv")

Conflict.head(5)

We want to view the number of protests in Iraq on a map. Let's filter to protests by event type.

In [None]:
#Let's explore the the different event types in this column before we filter:

Conflict["EVENT_TYPE"].unique()

#* We only want Protest Events.

In [None]:
# Maybe we'd want to examine the frequency of each event type:

Conflict["EVENT_TYPE"].value_counts()

In [None]:
Conflict = Conflict[Conflict["EVENT_TYPE"] == 'Protests']

Conflict.head(5)

In order to plot these on a map, we would need to import the Geopandas package, a geospatial python package that extends pandas functionality to geospatial data.

In [None]:
import geopandas

Let's also import the boundaries for Iraq's governorates, so we can segregate protest data by governorate. Notice we added a new column 'Governorate Number'. We will use this as the ID number for the Governorates as the data did not come with Governorate Names.

In [None]:
# We got this data from: https://data.humdata.org/dataset/cod-ab-irq?

# UN e.t.c have lots of data, particularly for Iraq. See what you can find.

Iraq = geopandas.read_file("Map/iraq_governates.shp")

Iraq['Governorate Number'] = numpy.arange(len(Iraq))

Iraq.head(10)

Let's plot this:

In [None]:
ax = Iraq.plot(figsize = (10,10), facecolor = 'none')
ax.axis('off')

Now that we have the different governorates, we can assign data to them using geopandas functionality:

In [None]:
Conflict.columns

In [None]:
# First, we must ensure they use the same coordinate system. In both cases the data uses WGS 84 so we are fine.

# We are going to join the two datasets. We are going to join them as geodataframes, so we have to convert the conflict data into a geodataframe.

# For the Conflict Data, we do not need all the columns either.

ConflictGeoDataFrame = geopandas.GeoDataFrame(Conflict[['EVENT_DATE', 'YEAR', 'ACTOR1', 'ACTOR2', 'ADMIN3', 'LOCATION', 'LONGITUDE', 'LATITUDE','FATALITIES', 'NOTES']], geometry=geopandas.points_from_xy(Conflict['LONGITUDE'], Conflict['LATITUDE']))

ConflictGeoDataFrame.head(5)

#Iraq.sjoin(Conflict, how="left")

Notice that converting the dataframe to a geodataframe, using longitude and latitude, we create a new column called geometry. We can drop the longtiude and latitude columns now.

In [None]:
ConflictGeoDataFrame.drop(columns=['LONGITUDE', 'LATITUDE'], inplace=True)

ConflictGeoDataFrame.head(1)

Now we can focus on joining the data, to get number of protests by governorate:

In [None]:
Protests = Iraq.sjoin(ConflictGeoDataFrame, how = 'inner')

Protests.head(5)

We can plot this as chloropleth map, with colour denoting the number of protests by governorate over the entire time period. To do this, we simply add a new column which equals '1', which we can aggregate as a sum by the column geometry.

In [None]:
Protests['Number of Protests'] = 1

Protests = Protests[['geometry', 'Governorate Number', 'Number of Protests']]

# Create a new list of values:
IraqProtests = Protests.groupby('Governorate Number')['Number of Protests'].agg('sum')

IraqProtests

Now we can merge the aggregated protest data with our Governarate data.

In [None]:
Iraq['Number of Protests'] = IraqProtests
Iraq

In [None]:
ax = Iraq.plot(column='Number of Protests', cmap = 'viridis', legend=True, figsize=(10,10),legend_kwds={'label': "Number of Protests Since 2016",'orientation': "horizontal", "pad": 0.01}, vmin=0, vmax=1000);

ax.axis('off')

We can add the location of the protests as well, in combining data in one plot.

In [None]:
ax = Iraq.plot(column='Number of Protests', cmap = 'viridis', legend=True, figsize=(10,10),legend_kwds={'label': "Number of Protests Since 2016",'orientation': "horizontal", "pad": 0.01}, vmin=0, vmax=1000);

ConflictGeoDataFrame.plot(color='r', alpha=0.2, markersize = 5, ax=ax)

ax.axis('off')

We can come back to visualizing this as a Bar graph by governorate.

In [None]:
seaborn.barplot(x = "Governorate Number", y="Number of Protests", data = Iraq)

And if we wanted to, look at the most frequent locations for protests:

In [None]:
FrequencyofProtest = ConflictGeoDataFrame['LOCATION'].value_counts()

FrequencyofProtest.head(5)

As a bar chart of the top 20 locations:

In [None]:
FrequencyofProtest = FrequencyofProtest.head(20)

FrequencyofProtest = pandas.DataFrame({'Location':FrequencyofProtest.index, 'Number of Protests': FrequencyofProtest.values})

fig, ax = plt.subplots(figsize = (30, 10))

seaborn.barplot(x = "Location", y="Number of Protests", data = FrequencyofProtest, ax=ax)