# Assignment 1
**Emma McCready**

---

### Analysis of population over time

You are required to collect, process, analyse and interpret the data in order to identify possible issues/problems at present and make predictions/classifications in regard to the future. This analysis will rely on the available data from CSO and any additional data you deem necessary (with supporting evidence) to support your hypothesis for this scenario.

Areas of focus:
* Annual Population Change
* Population Forecasting


# Part one: Data prep

In [2]:
# Load in packages
import pandas as pd
import numpy as np
from scipy import stats


In [3]:
# Load in data

pop_data = pd.read_csv("migration_data.csv")

#printing the first 5 rows
pop_data.head()

Unnamed: 0,STATISTIC Label,Year,Country,Sex,Origin or Destination,UNIT,VALUE
0,Estimated Migration (Persons in April),1987,United Kingdom (1),Both sexes,Net migration,Thousand,-13.7
1,Estimated Migration (Persons in April),1987,United Kingdom (1),Both sexes,Emigrants: All destinations,Thousand,21.8
2,Estimated Migration (Persons in April),1987,United Kingdom (1),Both sexes,Immigrants: All origins,Thousand,8.1
3,Estimated Migration (Persons in April),1987,United Kingdom (1),Male,Net migration,Thousand,-9.0
4,Estimated Migration (Persons in April),1987,United Kingdom (1),Male,Emigrants: All destinations,Thousand,13.1


### Getting a sense of the data...

In [4]:
# Getting the length of the dataframe along with the number of columns, in the output (number of rows, number of cols)
pop_data.shape

(2664, 7)

In [5]:
# getting data types..

pop_data.dtypes

STATISTIC Label           object
Year                       int64
Country                   object
Sex                       object
Origin or Destination     object
UNIT                      object
VALUE                    float64
dtype: object

In [6]:
# getting a sense of the distrubution and variables..

pop_data.describe()

Unnamed: 0,Year,VALUE
count,2664.0,2104.0
mean,2005.0,8.943726
std,10.679083,15.513703
min,1987.0,-43.9
25%,1996.0,1.8
50%,2005.0,4.7
75%,2014.0,10.2
max,2023.0,151.1


From the above info, the "year" column and the "VALUE" column are the only numerical variables. Year happens to be a categorical variable, and the information I'm interested in is that the min is 1987 and the max is 2023. This means there is population migration data for this period, 1987 to 2023.  
Regarding the value column, I note from when I called `pop_data.head()` that this is vairable contains information on the net migration, as well as the number of incoming and outgoing people all as individual observations. So, the statistical information extracted (ie. the mean, std, etc.) isn't reliable.  
Also worth noting is the difference between the counts for "Year" and "VALUE". There's a lot more values for Year, implying missing data in VALUE. So I would like to find out just how many are empty, and may as well check the whole dataset for missing data:

In [7]:
# find whether columns contain null values
print(pop_data.isnull().sum())

STATISTIC Label            0
Year                       0
Country                    0
Sex                        0
Origin or Destination      0
UNIT                       0
VALUE                    560
dtype: int64


In [8]:
# Summarising the categorical variables:
pop_data.describe(include=object)

Unnamed: 0,STATISTIC Label,Country,Sex,Origin or Destination,UNIT
count,2664,2664,2664,2664,2664
unique,1,8,3,3,1
top,Estimated Migration (Persons in April),United Kingdom (1),Both sexes,Net migration,Thousand
freq,2664,333,888,888,2664


Not much valuable info above I think? But I'm curious about the modes for each of these, I assume they are just listed in as the mode as they are the top value in the dataset, but just for piece of mind:

In [9]:
print('Value counts for "Country":\n', pop_data['Country'].value_counts(), '\n\nValue counts for "sex":\n', pop_data['Sex'].value_counts())



Value counts for "Country":
 United Kingdom (1)                                     333
United States                                          333
Canada                                                 333
Australia                                              333
Other countries (23)                                   333
All countries                                          333
EU14 excl Irl (UK & Ireland)                           333
EU15 to EU27 (accession countries joined post 2004)    333
Name: Country, dtype: int64 

Value counts for "sex":
 Both sexes    888
Male          888
Female        888
Name: Sex, dtype: int64


In [10]:
# Obtaining some further information on the dataset:
pop_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2664 entries, 0 to 2663
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   STATISTIC Label        2664 non-null   object 
 1   Year                   2664 non-null   int64  
 2   Country                2664 non-null   object 
 3   Sex                    2664 non-null   object 
 4   Origin or Destination  2664 non-null   object 
 5   UNIT                   2664 non-null   object 
 6   VALUE                  2104 non-null   float64
dtypes: float64(1), int64(1), object(5)
memory usage: 145.8+ KB


Initial thoughts: This dataset is a count of the total number of emigrants and immigrants in any given year, to and from a given county.

The first column is a bit redundant and it's a bit inconvenient to have the net migration plus the number of immigrants and emigrants in the same column I think, but I should be able to find a work around without making additional unnecessary columns.

### Tiding it up a bit to make it easier to work with:
Ideas for tidying:
* make some things lower case?
* rename cols, e.g. instead of "Origin or Destination" change it to migration_type

In [13]:
# Making it lowercase
## acc no dont do this

#pop_data = pop_data[.lower()

# Renaming columns


# making sure the there is only 3 types of "Origin or Destination"
#print(pop_data['Origin or Destination'].value_counts())          # output confirms this 

#rename col


Net migration                  888
Emigrants: All destinations    888
Immigrants: All origins        888
Name: Origin or Destination, dtype: int64


In [None]:
## Goal: Extract all instances of net migration and move it into a new dataframe. Make sure to bring all other cols with it.

In [46]:
#loc to pull out the info I want
#net_migration = pop_data.loc[["Net migration"], ["Origin or Destination"]]

## this is using = a boolean mask...
net_migration = (pop_data["Origin or Destination"] == "Net migration")  # assigns each row w True or False based on whether or not the data is 'Net migration
both_sexes = (pop_data["Sex"] == "Both sexes") # Same as above, but for both sexes
net_migration = pop_data[net_migration & both_sexes]# creates a new dataframe with just rows where net_migration was true for both sexes

#net_migration.head()

# Creating the opposite dataset.. figure this out?
#not_net_migration = ~pop_data[net_migration]
#individual_sexes = ~both_sexes
#migration_by_sex = not_net_migration & individual_sexes #pop_data[~net_migration & ~both_sexes]

ValueError: Boolean array expected for the condition, not object

Now the statistical info for the migration will be much more meaningful;

In [42]:
net_migration["VALUE"].describe()

count    236.000000
mean       5.311017
std       17.635191
min      -43.900000
25%       -1.400000
50%        1.550000
75%        7.250000
max      104.800000
Name: VALUE, dtype: float64