# Cleaning Dataset before plotting

In [3]:
import numpy as np
import pandas as pd

First, download and import the dataset using pandas

In [4]:
df_can = pd.read_excel(
    'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.xlsx',
    sheet_name='Canada by Citizenship',
    skiprows=range(20),
    skipfooter=2)

print('Data read into a pandas dataframe!')

Data read into a pandas dataframe!


In [None]:
df_can.head()

Before analyzing, start by getting basic information about the dataframe

In [None]:
df_can.info(verbose=False)

In [None]:
df_can.columns

In [None]:
df_can.index

Note: The default type of `index` and `columns` are NOT list. To convert those variables to list

In [None]:
df_can.columns.tolist()
df_can.index.tolist()

In [None]:
print(type(df_can.columns.tolist()))
print(type(df_can.index.tolist()))

Lets clean the data set by removing unnecessary columns

In [None]:
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)

Lets also rename the columns to name that makes more sense

In [None]:
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns

We will add Total column that sums up the total immigrants by country over the eniter period 1980-2013

In [None]:
df_can['Total'] = df_can.sum(axis=1)
df_can

checking for any null objects

In [None]:
df_can.isnull().sum()

Filtering columns and just looking at Country column

In [None]:
df_can.Country

Filtering list of countries and date for years 1980-1985

In [None]:
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]]

First lets set `Country` column as our index

In [None]:
df_can.set_index('Country', inplace=True)
df_can.head(3)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()

There are main 2 ways to select rows:

```python
    df.loc[label]    # filters by the labels of the index/column
    df.iloc[index]   # filters by the positions (indexing) of the index/column
```

Looking for specific data

In [None]:
# 1. the full row data (all columns)
df_can.loc['Japan']

In [None]:
# alternate methods
df_can.iloc[87]

In [None]:
df_can[df_can.index == 'Japan']

In [None]:
# 2. for year 2013
df_can.loc['Japan', 2013]

In [None]:
# alternate method
# year 2013 is the last column, with a positional index of 36
df_can.iloc[87, 36]

In [None]:
# 3. for years 1980 to 1985
df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]]

In [None]:
# Alternative Method
df_can.iloc[87, [3, 4, 5, 6, 7, 8]]

Lets convert the year column (integer) to string. This code will convert everything into string
- The map() function executes a specified function for each item in an iterable. The item is sent to the function as a parameter.

In [None]:
df_can.columns = list(map(str, df_can.columns))

Declare a variable that will allow us to call full range of years

In [6]:
years = list(map(str, range(1980, 2014)))
years

['1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013']

We can filter data based on certain criteria by giving a condition

This will only give data where Continent is Asia

In [None]:
condition = df_can['Continent'] == 'Asia'
df_can[condition]

We can also pass multiple criteria in the same line

In [None]:
df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]

# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'

# Visualizing using Matplotlib

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

Extract the data series for Haiti

In [None]:
years

In [8]:
haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head(5)

KeyError: 'Haiti'

Now we plot a line plot

In [None]:
haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.text(2000, 6000, '2010 Earthquake') # Text annotation 2000 being x-axis and 6000 being y-axis.

plt.show() # need this line to show the updates made to the figure

Pandas automatically populate the x-axis with `years` (index) and y-axis with `population` (column)

Note: `transpose()` method will swap the row and columns when plotting. `haiti.transpose()`