### Explore Data

In [None]:
import pandas as pd

### Immigration data

In [None]:
fname = 'data/immigration_data_sample.csv'
df_immigrant = pd.read_csv(fname)
df_immigrant.head()

In [None]:
df_immigrant.columns

### Immigrant data column descriptions, lists see I94_SAS_Labels_Descriptions.SAS


| column name| description | type | is used by CIC | personal observations for ETL | 
| --- | --- | --- | --- | --- | 
| cicid |  unique number for the immigrants | int | true | might be primary key  | 
| i94yr | 4 digit year  | float  | true | use int  | 
| i94mon | numeric month | float |true  | use int  | 
| i94cit | 3 digit code of origin city | float | true | use int, if empty remove line, quality check for valid code according to list  in .sas file | 
| i94res | 3 digit code from country from where one has travelled. | float |true | use int, if empty remove line, quality check for valid code according to list  in .sas file  | 
| i94port | 3 character code of destination USA city | varchar  | true | use int, if empty remove line, quality check for valid code according to list  in .sas file |  
| arrdate | date of arrrival in U.S.A | SAS date numeric   | true | It is a SAS date numeric field that a permament format has not been applied.  Please apply whichever date format works for you.   | 
| i94mode | travel code (transportation) | 1 digit | true | 1 = 'Air' 2 = 'Sea' 3 = 'Land' 9 = 'Not reported'  | 
| i94addr | ??? | two digit | true | There is lots of invalid codes in this variable and the list below shows what we have found to be valid, everything else goes into 'other', not sure if we should use it | 
| depdate | the Departure Date from the USA | SAS numeric field |true | Please apply whichever date format | 
|i94bir  |Age of Respondent in Years  | float |true  | use int | 
| i94visa | Visa codes collapsed into three categories | 1 char | true | 1 = Business 2 = Pleasure 3 = Student | 
| count|Used for summary statistics | int | true | not sure if needed |
|dtadfile |Character Date Field - Date added to I-94 Files | |false | |
| visapost| Department of State where where Visa was issued | |false | |
| occup|Occupation that will be performed in U.S. | |false | |
| entdepa| Arrival Flag - admitted or paroled into the U.S. | |false | |
|entdepd |Departure Flag - Departed, lost I-94 or is deceased  | | false| |
|entdepu|Update Flag - Either apprehended, overstayed, adjusted to perm residence||false| |
|matflag|Match flag - Match of arrival and departure records| 1 char|true| not sure if needed |
|biryear | year of birth| 4 digit|true||
|dtaddto|Character Date Field - Date to which admitted to U.S. (allowed to stay until)||false||
|gender| Non-immigrant sex |1 digit|true||
|insnum|INS number|number|true|check how many NaN, and if to skip whole column|
|airline|Airline used to arrive in U.S|varchar|true|check how many NaN, and if to skip whole column|
|admnum| Admission Number ||true| find out what this is|
|fltno| Flight number of Airline used to arrive in U.S|varchar|true||
|visatype| Class of admission legally admitting the non-immigrant to temporarily stay in U.S.||||


"CIC does not use" means that the column has not been used by CIC for analysis. 
This could be interpreted in multiple ways. One interpretation is probably the data are not as clean as those from other columns.
So we will not takeover this data.

In [None]:
print('number of rows:', len(df_immigrant))
df_immigrant.describe

In [None]:
df_immigrant.groupby(['biryear'])[['i94yr']].count()

## Demographics

In [None]:
fname = 'data/us-cities-demographics.csv'
df_demographics = pd.read_csv(fname, delimiter=';')
df_demographics.head()

In [None]:
print(df_demographics.columns)
df_demographics.count()

### Demographics data column descriptions


| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
| City | name of the city| varchar | | 
| State | name of the state| varchar | | 
| Median Age| median age|float||
| Male population|number of male poppulation|float| use int|
| Female population|number of femal population| float| use int|
|Total population |total population| int| maybe qa check for sum of male and female|
|Number of Veterans|number of veterans| float| can be int, but not needed for immigration data analytics|
|Foreign-born| foreign born| float| use int|
|Average household size |average household size| float||
|State code | state code| varchar| american states, maybe important for dim_location |
|Race| race| string ||
|Count|???| int| maybe some count from other analysis, can be skipped |


## Airport codes

In [None]:
fname = 'data/airport-codes_csv.csv'
df_airport_codes = pd.read_csv(fname)
print(df_airport_codes.head(20))

print('total length:', len(df_airport_codes))


### Airport data data column descriptions

| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
|  ident| unique identifier of airport | Varchar |  |
| type | type of airport  | varchar | could be enum:   heliport, small_airport, medium_airport, closed |
| name | name of airport | varchar |  |
| elevation_ft| elevation in feet | float |  |
| continent | varchar |  | a lot of NaN, maybe skip this column  |
| iso_country | country iso |varchar  |  2 chars|
| iso_region | region iso | varchar |  pattern XX-XX|
| municipality |  municipality | string |  |
| gps_code | gps code |  short varchar | check for NaN  |
| iata_code | iata code |  | as there are a lot of NaN, we will skip this |
| local_code |  |  |  also a lot of NaN, see if should be skipped|
| coordinates | len and lat  | float, float | len and lat as duple |


### Weather

In [None]:
fname = 'data/GlobalLandTemperaturesByCity.csv'
df_temperature = pd.read_csv(fname)

# get only america as the immigrant data is only about america

df_temperature_us = df_temperature[df_temperature["Country"] == "United States"]
print(df_temperature_us)
print(len(df_temperature_us))

In [None]:
print('length:', len(df_temperature))
df_temperature.count()

### Weather data column descriptions

| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
|dt | date | string YYYY-MM-DD |  |
|AverageTemperature | avg temp | float| check NAN |
|AverageTemperatureUncertainty| temp uncertainty |float|check NAN, not important for now|
|City||||
|Country||||
|Latidude| latitude|float "54.06N"| remove "N", parse to tuple?|
|Longidute| longitude| float "54.06E"|remove "E", parse to tuple?|