### Explore Data

In [2]:
import pandas as pd

### Immigration data

In [5]:
fname = 'immigration_data_sample.csv'
df_immigrant = pd.read_csv(fname)
df_immigrant.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [7]:
df_immigrant.columns

Index(['Unnamed: 0', 'cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port',
       'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa',
       'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')

### Immigrant data column descriptions, lists see I94_SAS_Labels_Descriptions.SAS


| column name| description | type | is used by CIC | personal observations for ETL | 
| --- | --- | --- | --- | --- | 
| cicid |  unique number for the immigrants | int | true | might be primary key  | 
| i94yr | 4 digit year  | float  | true | use int  | 
| i94mon | numeric month | float |true  | use int  | 
| i94cit | 3 digit code of origin city | float | true | use int, if empty remove line, quality check for valid code according to list  in .sas file | 
| i94res | 3 digit code from country from where one has travelled. | float |true | use int, if empty remove line, quality check for valid code according to list  in .sas file  | 
| i94port | 3 character code of destination USA city | varchar  | true | use int, if empty remove line, quality check for valid code according to list  in .sas file |  
| arrdate | date of arrrival in U.S.A | SAS date numeric   | true | It is a SAS date numeric field that a permament format has not been applied.  Please apply whichever date format works for you.   | 
| i94mode | travel code (transportation) | 1 digit | true | 1 = 'Air' 2 = 'Sea' 3 = 'Land' 9 = 'Not reported'  | 
| i94addr | ??? | two digit | true | There is lots of invalid codes in this variable and the list below shows what we have found to be valid, everything else goes into 'other', not sure if we should use it | 
| depdate | the Departure Date from the USA | SAS numeric field |true | Please apply whichever date format | 
|i94bir  |Age of Respondent in Years  | float |true  | use int | 
| i94visa | Visa codes collapsed into three categories | 1 char | true | 1 = Business 2 = Pleasure 3 = Student | 
| count|Used for summary statistics | int | true | not sure if needed |
|dtadfile |Character Date Field - Date added to I-94 Files | |false | |
| visapost| Department of State where where Visa was issued | |false | |
| occup|Occupation that will be performed in U.S. | |false | |
| entdepa| Arrival Flag - admitted or paroled into the U.S. | |false | |
|entdepd |Departure Flag - Departed, lost I-94 or is deceased  | | false| |
|entdepu|Update Flag - Either apprehended, overstayed, adjusted to perm residence||false| |
|matflag|Match flag - Match of arrival and departure records| 1 char|true| not sure if needed |
|biryear | year of birth| 4 digit|true||
|dtaddto|Character Date Field - Date to which admitted to U.S. (allowed to stay until)||false||
|gender| Non-immigrant sex |1 digit|true||
|insnum|INS number|number|true|check how many NaN, and if to skip whole column|
|airline|Airline used to arrive in U.S|varchar|true|check how many NaN, and if to skip whole column|
|admnum| Admission Number ||true| find out what this is|
|fltno| Flight number of Airline used to arrive in U.S|varchar|true||
|visatype| Class of admission legally admitting the non-immigrant to temporarily stay in U.S.||||


"CIC does not use" means that the column has not been used by CIC for analysis. 
This could be interpreted in multiple ways. One interpretation is probably the data are not as clean as those from other columns.
So we will not takeover this data.

In [39]:
print('number of rows:', len(df_immigrant))
df_immigrant.describe

number of rows: 1000


<bound method NDFrame.describe of      Unnamed: 0      cicid   i94yr  i94mon  i94cit  i94res i94port  arrdate  \
0       2027561  4084316.0  2016.0     4.0   209.0   209.0     HHW  20566.0   
1       2171295  4422636.0  2016.0     4.0   582.0   582.0     MCA  20567.0   
2        589494  1195600.0  2016.0     4.0   148.0   112.0     OGG  20551.0   
3       2631158  5291768.0  2016.0     4.0   297.0   297.0     LOS  20572.0   
4       3032257   985523.0  2016.0     4.0   111.0   111.0     CHM  20550.0   
..          ...        ...     ...     ...     ...     ...     ...      ...   
995     2117909  4288772.0  2016.0     4.0   135.0   135.0     LVG  20567.0   
996     1463022  2947585.0  2016.0     4.0   261.0   261.0     PSP  20560.0   
997     1414569  2883298.0  2016.0     4.0   111.0   111.0     MIA  20560.0   
998     1094181  2264857.0  2016.0     4.0   582.0   582.0     ATL  20556.0   
999     2271807  4654865.0  2016.0     4.0   687.0   687.0     MIA  20568.0   

     i94mode i94a

In [40]:
df_immigrant.groupby(['biryear'])[['i94yr']].count()

Unnamed: 0_level_0,i94yr
biryear,Unnamed: 1_level_1
1923.0,1
1928.0,1
1929.0,1
1931.0,1
1932.0,1
...,...
2011.0,3
2012.0,10
2013.0,4
2014.0,2


## Demographics

In [12]:
fname = 'us-cities-demographics.csv'
df_demographics = pd.read_csv(fname, delimiter=';')
df_demographics.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [43]:
print(df_demographics.columns)
df_demographics.count()

Index(['City', 'State', 'Median Age', 'Male Population', 'Female Population',
       'Total Population', 'Number of Veterans', 'Foreign-born',
       'Average Household Size', 'State Code', 'Race', 'Count'],
      dtype='object')


City                      2891
State                     2891
Median Age                2891
Male Population           2888
Female Population         2888
Total Population          2891
Number of Veterans        2878
Foreign-born              2878
Average Household Size    2875
State Code                2891
Race                      2891
Count                     2891
dtype: int64

### Demographics data column descriptions


| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
| City | name of the city| varchar | | 
| State | name of the state| varchar | | 
| Median Age| median age|float||
| Male population|number of male poppulation|float| use int|
| Female population|number of femal population| float| use int|
|Total population |total population| int| maybe qa check for sum of male and female|
|Number of Veterans|number of veterans| float| can be int, but not needed for immigration data analytics|
|Foreign-born| foreign born| float| use int|
|Average household size |average household size| float||
|State code | state code| varchar| american states, maybe important for dim_location |
|Race| race| string ||
|Count|???| int| maybe some count from other analysis, can be skipped |


## Airport codes

In [5]:
fname = 'airport-codes_csv.csv'
df_airport_codes = pd.read_csv(fname)
print(df_airport_codes.head(20))

print('total length:', len(df_airport_codes))


   ident           type                                name  elevation_ft  \
0    00A       heliport                   Total Rf Heliport          11.0   
1   00AA  small_airport                Aero B Ranch Airport        3435.0   
2   00AK  small_airport                        Lowell Field         450.0   
3   00AL  small_airport                        Epps Airpark         820.0   
4   00AR         closed  Newport Hospital & Clinic Heliport         237.0   
5   00AS  small_airport                      Fulton Airport        1100.0   
6   00AZ  small_airport                      Cordes Airport        3810.0   
7   00CA  small_airport             Goldstone /Gts/ Airport        3038.0   
8   00CL  small_airport                 Williams Ag Airport          87.0   
9   00CN       heliport     Kitchen Creek Helibase Heliport        3350.0   
10  00CO         closed                          Cass Field        4830.0   
11  00FA  small_airport                 Grass Patch Airport          53.0   

### Airport data data column descriptions

| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
|  ident| unique identifier of airport | Varchar |  |
| type | type of airport  | varchar | could be enum:   heliport, small_airport, medium_airport, closed |
| name | name of airport | varchar |  |
| elevation_ft| elevation in feet | float |  |
| continent | varchar |  | a lot of NaN, maybe skip this column  |
| iso_country | country iso |varchar  |  2 chars|
| iso_region | region iso | varchar |  pattern XX-XX|
| municipality |  municipality | string |  |
| gps_code | gps code |  short varchar | check for NaN  |
| iata_code | iata code |  | as there are a lot of NaN, we will skip this |
| local_code |  |  |  also a lot of NaN, see if should be skipped|
| coordinates | len and lat  | float, float | len and lat as duple |


### Weather

In [9]:
fname = 'GlobalLandTemperaturesByCity.csv'
df_temperature = pd.read_csv(fname)

# get only america as the immigrant data is only about america

df_temperature_us = df_temperature[df_temperature["Country"] == "United States"]
print(df_temperature_us)
print(len(df_temperature_us))

                 dt  AverageTemperature  AverageTemperatureUncertainty  \
47555    1820-01-01               2.101                          3.217   
47556    1820-02-01               6.926                          2.853   
47557    1820-03-01              10.767                          2.395   
47558    1820-04-01              17.989                          2.202   
47559    1820-05-01              21.809                          2.036   
...             ...                 ...                            ...   
8439242  2013-05-01              15.544                          0.281   
8439243  2013-06-01              20.892                          0.273   
8439244  2013-07-01              24.722                          0.279   
8439245  2013-08-01              21.001                          0.323   
8439246  2013-09-01              17.408                          1.048   

            City        Country Latitude Longitude  
47555    Abilene  United States   32.95N   100.53W  
47556

In [38]:
print('length:', len(df_temperature))
df_temperature.count()

length: 8599212


dt                               8599212
AverageTemperature               8235082
AverageTemperatureUncertainty    8235082
City                             8599212
Country                          8599212
Latitude                         8599212
Longitude                        8599212
dtype: int64

### Weather data column descriptions

| column name| description | type |personal observations for ETL | 
| --- | --- | --- | --- |
|dt | date | string YYYY-MM-DD |  |
|AverageTemperature | avg temp | float| check NAN |
|AverageTemperatureUncertainty| temp uncertainty |float|check NAN, not important for now|
|City||||
|Country||||
|Latidude| latitude|float "54.06N"| remove "N", parse to tuple?|
|Longidute| longitude| float "54.06E"|remove "E", parse to tuple?|