Youtube link: https://youtu.be/xf1qqlUgZpk

# Set Up

In [0]:
import numpy as np
import pandas as pd

!pip install -U plotly

import plotly.express as px
import plotly

Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.7.1)


# Background/Overview

Linguists have noticed a strange phenomenon dubbed third generation shift. Most immigrant groups switch to using English as their primary language by the 3rd generation. We found an interesting dataset that covers three generations of multiple Asian and Hispanic ethnicities and their language proficiency in various U.S. cities. 

As we compared general trends, we found that third generation Asian immigrants have a greater shift to English compared to third generation Hispanic immigrants. Why does this happen?

# Approach

First, we created multiple graphs in order split up this enormous dataset and look at larger trends by only using broad populations of 'Asian' and 'Hispanic' rather than the individual ethnicities in each group. Once we identified trends, we explored whether there was a geographical reason that could explain the difference between the two groups. Finally, we took a magnifying glass and compared each ethnicity to each other using bar graphs.

# Quick Summary

Although Asian immigrants had higher English Only scores than Hispanic immigrants in every generation, if one investigates how all the ethnicities within these broad categories, they will find that the trends within the groups vary greatly.

In conclusion, Asian and Hispanic are broad categories that fit very different and diverse cultures within them. To understand the third generation shift, one must look at individual ethnicities rather than these broad labels.

# Data

In [0]:
#The following 5 datasets originate from this source: https://ccis.ucsd.edu/_files/wp111.pdf
#English proficiency across three generations of multiple Asian and Hispanic ethnicities.
overall_language = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/out%201.csv')
#Asian population in tested cities in 2004. 
area_asian_pop = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/area_vs_asian_population.csv')
#Hispanic population in tested cities in 2004.
area_hispanic_pop = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/area_vs_hispanic_population.csv')
#English proficiency across three generations of Asians. 
area_asian_language = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/asian_generational_language_statistics.csv')
#English proficiency across three generations of Hispanics. 
area_hispanic_language = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/hispanic_generational_language_statistics.csv')

#County FIPS
zip_to_fips = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/ZIP-COUNTY-FIPS_2012-06.csv')

In [0]:
area_asian_language.columns = ['LOCATION', 'PROFICIENCY', 'GENERATION', 'PERCENTAGE']
area_hispanic_language.columns = ['LOCATION', 'PROFICIENCY', 'GENERATION', 'PERCENTAGE']

asian_cities = area_asian_language['LOCATION'].unique()
hispanic_cities = area_hispanic_language['LOCATION'].unique()

combined_cities = set(set(asian_cities) & set(hispanic_cities))

area_asian_language['ETHNICITY'] = "ASIAN"
area_hispanic_language['ETHNICITY'] = "HISPANIC"

area_combined_language = area_asian_language.append(area_hispanic_language, ignore_index=True)
area_combined_language = area_combined_language[area_combined_language['LOCATION'].isin(list(combined_cities))]

# Analysis

In the following analysis, the bar graphs are sorted vertically by generation. Asian ethnicities are represented on the right while Hispanic ethnicities are shown on the left. On the x axis of each graph is city location. On the y axis of each graph is the percent that said English Well (Blue), English Not Well (Red), and English Only (Green).

In [0]:
area_combined_language
fig = px.bar(area_combined_language, x="LOCATION", y="PERCENTAGE", color='PROFICIENCY', facet_col="ETHNICITY", facet_row="GENERATION", barmode='group',
             height=800)
fig.show()

In the following analysis, second generation Asian and Hispanic language persistance is represented on the choropleth graphs. The top graph is the second generation Asian population and the bottom is the second generation Hispanic population.

In [0]:
second_gen_asians = area_asian_language[area_asian_language["GENERATION"] == 2]
second_gen_hispanics = area_hispanic_language[area_hispanic_language["GENERATION"] == 2]

def find_loc(location):
  if(location.find('-') > -1):
    cityname = location.split('-')[0]
    statename = ""
    if(cityname[-2:].isupper()):
      statename = cityname[-2:]
    elif(location[-5:-3].isupper()):
      statename = location[-5:-3]
    elif(location[-2:].isupper()):
      statename = location[-2:]
    if(len(statename) > 0):
      return search.by_city_and_state(cityname, statename)[0]
    else:
      return search.by_city(cityname)[0]
  else:
    cityname = location[:-2]
    statename = ""
    if(location[-2:].isupper()):
      statename = location[-2:]
    if(len(statename) > 0):
      return search.by_city_and_state(cityname, statename)[0]
    else:
      return search.by_city(cityname)[0]

second_gen_asians['state'] = second_gen_asians['LOCATION'].map(lambda loc: find_loc(loc).state)
second_gen_hispanics['state'] = second_gen_hispanics['LOCATION'].map(lambda loc: find_loc(loc).state)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [0]:
second_gen_asians['percents'] = second_gen_asians['PERCENTAGE'].map(lambda percent: float(percent))
second_gen_hispanics['percents'] = second_gen_hispanics['PERCENTAGE'].map(lambda percent: float(percent))

state_columns = ['state', '2nd gen language persistence in asians', '2nd gen language persistence in hispanics']

astates = second_gen_asians['state'].unique()
hstates = second_gen_hispanics['state'].unique()
astatewise_df = pd.DataFrame(columns=state_columns)
hstatewise_df = pd.DataFrame(columns=state_columns)

for abbr in astates:
  arows = second_gen_asians[second_gen_asians['state'] == abbr]
  apersistence = 0.0
  anum_entries = len(arows.index)
  for i in range (0, len(arows.index)):
    row = arows.iloc[i]
    proficiency = row['PROFICIENCY']
    if(proficiency == 'English Well'):
      apersistence += row['percents']
    elif(proficiency == 'English Not Well'):
      apersistence += row['percents']
  ascaled_persistence = apersistence / (anum_entries / 3)
  astatewise_df = astatewise_df.append({'state': abbr, '2nd gen language persistence in asians': ascaled_persistence}, ignore_index=True)

for abbr in hstates:
  hrows = second_gen_hispanics[second_gen_hispanics['state'] == abbr]
  hpersistence = 0.0
  hnum_entries = len(hrows.index)
  for i in range (0, len(hrows.index)):
    row = hrows.iloc[i]
    proficiency = row['PROFICIENCY']
    if(proficiency == 'English Well'):
      hpersistence += row['percents']
    elif(proficiency == 'English Not Well'):
      hpersistence += row['percents']
  hscaled_persistence = hpersistence / max((hnum_entries / 3), 1)
  hstatewise_df = hstatewise_df.append({'state': abbr, '2nd gen language persistence in hispanics': hscaled_persistence}, ignore_index=True)

fig2 = px.choropleth(astatewise_df, locations='state', color='2nd gen language persistence in asians', color_continuous_scale="Viridis", range_color=[30, 90], scope='usa', locationmode='USA-states')
fig2.show()
fig3 = px.choropleth(hstatewise_df, locations='state', color='2nd gen language persistence in hispanics', color_continuous_scale="Viridis", range_color=[30, 90], scope='usa', locationmode='USA-states')
fig3.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In the following analysis, the three bar graphs show individual Asian and Hispanic ethnicitices (x axis) and percent (y axis) for different levels of English proficiency. The top graph shows English exclusive. The middle shows English proficient. The last shows English not proficient.

In [0]:
overall_language['numpercent'] = overall_language['Percent'].map(lambda per: float(per))

pruned_overall_language = overall_language[overall_language['numpercent'] != 0]

english_only_ethnicities = pruned_overall_language[pruned_overall_language['Proficiency'] == 'English Only']
english_well_ethnicities = pruned_overall_language[pruned_overall_language['Proficiency'] == 'English Well']
english_not_well_ethnicities = pruned_overall_language[pruned_overall_language['Proficiency'] == 'English Not Well']

fig_eoe = px.bar(english_only_ethnicities, x='Generation', y='numpercent', title="English-exclusive speakers per generation", color='Ethnicity', barmode='group')
fig_eoe.show();
fig_ewe = px.bar(english_well_ethnicities, x='Generation', y='numpercent', title="English-proficient speakers per generation", color='Ethnicity', barmode='group')
fig_ewe.show();
fig_enwe = px.bar(english_not_well_ethnicities, x='Generation', y='numpercent', title="English-not proficient speakers per generation", color='Ethnicity', barmode='group')
fig_enwe.show();

# Interpretations and Conclusions

We wanted to understand why third generation Asian immigrants had higher English proficiency than third generation Hispanic immigrants. We found that Asian immigrants had higher English Only scores than Hispanic immigrants in every generation. 

We examined if this had any geographical effect. When graphing location with second generation Asian immigrant mother tongue language persistance, we found that Wisconsin had a large retention compared to other states. On further investigation, we found that Wisconsin has a high Hmong population compared to other states.

In contrast, Hispanic second generation language persistance was relatively homogenous between states compared to Asian language persistance. 

Finally, we examined individual Asian and Hispanic ethnicities and found that they vary greatly from the overall trends.

For example, Japanese immigrants have a spike in English exclusive speakers in their second and third generation, while Vietnamese and Laotians spiked only in the third generation. 

Indian immigrants maintained relatively high levels of English exclusive speakers compared to Hispanic and Asian groups. In contrast, Dominican immigrants maintained relatively low levels of English exclusive speakers compared to Hispanic and Asian groups even at the third generation.

In conclusion, Asian and Hispanic are broad categories that fit very different and diverse cultures within them. To understand the third generation shift, one must look at individual ethnicities rather than these broad labels.

# Future Directions

We found many interesting trends within individual ethnicities in terms of where there are spikes or a lack of spikes in each group. Some paths to explore could include:
- Does foreign conflict affect immigrant groups shift toward English exclusive language? 
  - The spike in English proficiency occurs in second generation Japanese immigrants which correlates with World War Two. 
  - The spike in English proficiency occurs in third generation Vietnamese immigrants which correlates with the Vietnam War.
- Why Domincan immigrant groups do not have as strong of a third generation shift compared to other immigrant groups?
- Does population density of individual ethnic groups affect mother tongue language retention?