# Analyzing Cultural Diversity through Census Data of Language Spoken at Home

By Jordan Crawford-O'Banner and Alli Busa

In the news, we've heard that diversity has increased in the United States. While the Census has praised the U.S. for becoming overall more racial and ethnically diverse, FOX newscaster Tucker Carlson has claimed that the rise of immigrants in the U.S. is not happening in "politicians' neighborhoods". (See Project Proposal)

We wanted to take these two clashing statements and put them both to the test. Has cultural diversity, measured through linguistic diversity, increased significantly in the past few years? How is cultural diversity, through linguistic diversity, spread out throughout the U.S.? Are regions with high ratios of non-English speakers actually linguistically diverse or homogenuously consisting of a non-English speaking community? 

## Importing Necessary Packages and Setting Up Data Imports

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')

from thinkstats2 import Pmf, Cdf

import thinkstats2
import thinkplot

In [2]:
#Creating lists of strings to make iterating through excel sheets easier

#State abbreviations
#Taken from https://gist.github.com/JeffPaine/3083347
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "US"]

#State names
#Taken from https://gist.github.com/tleen/6299431
statesfull=['Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut',
            'District of Columbia','Delaware','Florida','Georgia',
            'Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine',
            'Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska',
            'Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota',
            'Ohio','Oklahoma','Oregon','Pennsylvania',
            'Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont',
            'Virginia','Washington','West Virginia','Wisconsin','Wyoming','Puerto Rico']

In [3]:
def createSeries(emptySeries, arrayNames, textfileName, index, Column, skiprows=[0,1,2]):
    """
    Makes a Series with specific language information from excel file. 
    It is used for organizing data on total English speakers and total non-English speakers.
    
    It takes in the name of an empty series, into which the data goes
                arrayNames - this will be either states or statesfull, depending on how Excel sheets are organized
                textfileName - the datafile
                index - either 0 for total English speakers or 1 for total non-English speakers
                Column - depending on the year, the column which contains the number of speakers is either given as "Number of speakers" or "Number of speakers1"
                skiprows - the number of rows before the data begins changes based on Excel file. For the 2006-2008 data, it is [0,1,2], for the 2009-2013 data, it is [0,1,2,3]
                
    """
    i=0
    for area in arrayNames:

        df = pd.read_excel(textfileName, area, skiprows=skiprows).dropna()

        #dictionary[area] = df.loc[dictionaryindex1:dictionaryindex1,Column]
        emptySeries.set_value(i,df.loc[index, Column])
        i+=1


In [5]:
#English and Non-English Speakers for 2006, Creating dictionary and arrays

#totallanguage_dict_06 = dict.fromkeys(states)
english2006=pd.Series()
other2006=pd.Series()
txtfile2006 = "/home/alli/The-Mother-Tongue-of-US-Communities/raw_data/DetailedLanguageSpoken_State_20062008.xls"

createSeries(english2006, states, txtfile2006, 1,"Number of speakers", skiprows=[0,1,2])
    
createSeries(other2006, states, txtfile2006, 2,"Number of speakers", skiprows=[0,1,2])



In [6]:
#English and Non-English Speakers for 2009, Creating dictionary and arrays

#totallanguage_dict_09 = dict.fromkeys(statesfull)
english2009=pd.Series()
other2009=pd.Series()
txtfile2009 =  "/home/alli/The-Mother-Tongue-of-US-Communities/raw_data/LanguageSpokenatHome_State_2009-2013.xls"

#things that are different about this file : skiprows is [0,1,2,3], column is "NumberofSpeakers1" and states is statesfull
createSeries(english2009, statesfull, txtfile2009, 1,"Number of speakers1", skiprows=[0,1,2,3])
    
createSeries(other2009, statesfull, txtfile2009, 2, "Number of speakers1", skiprows=[0,1,2,3])



## Initial Analysis - Probability Mass Functions (PMFs) 
We will begin by plotting PMFs to try to answer "Has cultural diversity, measured through linguistic diversity, increased significantly in the past few years? How is cultural diversity, through linguistic diversity, spread out throughout the U.S.? "

In [7]:
def plotPMFs(english_speakers, other_speakers):
    """
    Takes the number of only English speakers and number of speakers of other languages, per census file. 
    Returns two graphs with PMF of only English speakers and PMF of speakers of 
    other languages as normalized percentages of total population of those states. 
    """
    english_speakers_normalized = english_speakers/(english_speakers+other_speakers)
    other_speakers_normalized = other_speakers/(english_speakers+other_speakers)

    plt.figure(figsize=(20, 3))
    for i in range(0,50): 
        plt.bar(i, english_speakers_normalized[i])
        plt.autoscale(enable=True)
        plt.xticks(range(0,50), states[0:50])
        plt.title("PMF of MonoLingual English Speakers")
        plt.xlabel("State")
        plt.ylabel("Normalized Probability of MonoLingual English Speakers")

    plt.figure(figsize=(20, 3))
    for i in range(0,50):
        plt.bar(i, other_speakers_normalized[i])
        plt.autoscale(enable=True)
        plt.ylim([0.0,1.0])
        plt.xticks(range(0,50), states[0:50])
        plt.title("PMF of Speakers of Non-English Languages")
        plt.xlabel("State")
        plt.ylabel("Normalized Probability of Speakers of Non-English Languages")

In [8]:
def plotLineGraph(otherspeakers2000, otherspeakers2006, otherspeakers2009, state):
    """
    Takes the information about non-English speakers in the 2000, 2006-2008 and 2009-2013 periods as well
    as which state you want to analyze.
    Returns a line graph which displays number of total speakers as a function of time
    Also returns a string of the state name, so that it is easier to plot the legend 
    """
    
    indexstate = states.index(state) #finds the index of the state, in order to input into the series
    years = [2000,2006,2009]  #the initial years of the census data which we have 
    state_num = [otherspeakers2000[indexstate],otherspeakers2006[indexstate], otherspeakers2009[indexstate]] # the output
    plt.plot(years, [x / 1000 for x in state_num], '-o', label = str(state)) #dividing by 1000 in order to make numbers smaller
    plt.title("Number of Speakers of Non-English Languages in the 2000-2009 Time Frame")
    plt.xlabel("Year")
    plt.ylabel("non-English Speakers (thousands)")
    return str(state)

In [10]:
def returnStatesfromRatio(english, other, percent, sign):
    """
    Takes in a pair of total non-English speakers and total monolingual English speakers, as well as desired ratio and comparison
    Returns number of states which have a percentage of non-English speakers
    which is less than or more than the percentage specified. 

    Percent - Give percent in range (0:1)
    Sign - "Less" or "More"
    """
    othernorm = (other[0:50]/(english[0:50]+other[0:50])) #creating a normalized ratio
    if sign == "Less":
        result = othernorm2006 < percent
    if sign == "More":
        result = othernorm2006 > percent
        
    statesofinterest=[i for i, x in enumerate(result) if x]
    for i in statesofinterest:
        print(states[i])

In [None]:
plotPMFs(english2000, other2000)

In [None]:
plotPMFs(english2006, other2006 )

These graphs show on top the portion of the population for each state that only speaks English, and on the bottom the portion of the population for each state that speaks a language other than English. 

We see that a large portion of the population of all U.S. states only speak English. Also, California prominently has a large percent of speakers of non-English languages.

In this year, which states are have most speakers of non-English languages and which have the least?

In [None]:
returnStatesfromRatio(english2006, other2006, 0.2, "More")

In [None]:
returnStatesfromRatio(english2006, other2006, 0.1, "Less")

In [None]:
#Returning Which State Had the Least Amount of Non-English Speakers and How many
othernorm2006 = (other2006[0:50]/(english2006[0:50]+other2006[0:50])) 
print(str(states[othernorm2006.idxmin()])) #State in 2006 with smallest number of non-English speakers
print(str(othernorm2006.min()))

In [None]:
plotPMFs(english2009, other2009)

## Amount of Non-English Speakers per County in Potentially Interesting States

Next we will look at counties of states which had some of the most extreme ratios of speakers of only english to speakers of other languages. We will do this to answer : "How is cultural diversity, through linguistic diversity, spread out throughout the U.S.? " on a small scale.

In [12]:
def createCountySeries(countyList, countyNumbers, state ):
    """
    Takes in two empty series and a state.
    countyList - a list which will be populated with names of counties in state
    countyNumbers - a list containing total speakers of other languages per county
    state - state you wish to look at, given as ex: ["AZ"]
    index - the row number in which t
    """
    textfileName = "/home/alli/The-Mother-Tongue-of-US-Communities/raw_data/LanguageSpokenatHome_County_2009-2013.xls"
    i=0
    x1 = pd.ExcelFile(textfileName)
    sheetnames = x1.sheet_names
    for sheet in sheetnames:
        if any(x in sheet for x in state):
            df = pd.read_excel(textfileName,sheet, skiprows=[0,1,2,3]).dropna()
            #dictionary[area] = df.loc[1:2,"Number of speakers"]
            countyList.set_value(i,df["Number of speakers1"][1])
            countyNumbers.set_value(i, sheet)
            i+=1

In [13]:
def plotCounty(statename, countylist, numberpercounty):
    """
    Takes in the name of a state, and the two series which createCountySeries filled
    Plots the number of speakers of other languages by county
    """
    # Determines size of plot
    while True:
        if len(numberpercounty)>5:
            plt.figure(figsize=(20, 3))
            break
        else:
            plt.figure()
            break
        
    for i in range(0,len(countylist)): 
        plt.bar(i, numberpercounty[i])
        plt.autoscale(enable=True)
        plt.xticks(range(0,len(countylist)), countylist[0:len(countylist)],  rotation=70)
        plt.title("Other Language Speakers in "+str(statename)+" by County")
        plt.xlabel("County")
        plt.ylabel("Number of Other Language Speakers")

### California Counties
California contained the greatest percentage of non-English speakers. Are all counties equally contributing to this statistic?
CA, WA, PA, NY, MI

In [None]:
#creating series and calculating non-English speakers per county
CAcounty = pd.Series()
otherspeakers_CAcounties = pd.Series()
createCountySeries(otherspeakers_CAcounties, CAcounty,["CA"])

In [None]:
plotCounty("California", CAcounty, otherspeakers_CAcounties)

In [None]:
It looks like the answer is no. Los Angeles has substantially more non-English speakers than any other California county.

### New York Counties

New York is known for being the first place to which immigrants come. Let's see how many more immigrants New York, New York has than upstate New York counties.

In [None]:
#creating series and calculating non-English speakers per county
NYcounty = pd.Series()
otherspeakers_NYcounties = pd.Series()
createCountySeries(otherspeakers_NYcounties, NYcounty,["NY"])

In [None]:
plotCounty("New York", NYcounty, otherspeakers_NYcounties)

We were not expecting the number of non-English speakers in New York counties to be spread out to this extent. However, New York New York, Kings County, Queens county and Bronx county are all adjacent to each other, and all part of the major urban sprawl of the city of New York. If we group them together, we should find that New York City is contributing greatly to the number of non-English speakers in New York State.

### Washington Counties

In [None]:
#creating series and calculating non-English speakers per county
WAcounty = pd.Series()
otherspeakers_WAcounties = pd.Series()
createCountySeries(otherspeakers_WAcounties, WAcounty,["WA"])

In [None]:
plotCounty("Washington", WAcounty, otherspeakers_WAcounties)

### Pennsylvania Counties

In [None]:
#creating series and calculating non-English speakers per county
PAcounty = pd.Series()
otherspeakers_PAcounties = pd.Series()
createCountySeries(otherspeakers_PAcounties, PAcounty,["PA"])

In [None]:
plotCounty("Pennsylvania", PAcounty, otherspeakers_PAcounties)

### Michigan Counties

In [None]:
#creating series and calculating non-English speakers per county
MIcounty = pd.Series()
otherspeakers_MIcounties = pd.Series()
createCountySeries(otherspeakers_MIcounties, MIcounty,["MI"])

In [None]:
plotCounty("Michigan", MIcounty, otherspeakers_MIcounties)

## Analyzing Ratios of Different Languages Spoken in Them

How diverse are the most linguistically diverse communities actually? We stipulate that if all the different languages spoken in a community are, for example, European languages, that changes the perception of how diverse the community is. In this way, we hope to answer the question "Are regions with high ratios of non-English speakers actually linguistically diverse or homogenuously consisting of a non-English speaking community? "

In [None]:
def createListofLanguageGroups(textfileName, arrayNames, column1, skiprows):
    """
    Takes in census year and returns dataframe which contains language groups and corresponding number of respondents
    who speak a language in the language group
    
    textfileName - name of file
    arrayNames - either state or statesfull, depending on Excel file's sheet names
    column1 - because on 2006 the column is "Number of speakers" but on 2009, it's "Number of speakers1"
    skiprows - also depending on dataframe, we either skip [0,1,2] rows or [0,1,2,3] rows
    """
    #Accounting for different formatting of language names in different excel files
    if textfileName == txtfile2009:
         languagegroupstolookfor = ["SPANISH AND SPANISH CREOLE", ".Italian","..German", ".Scandinavian languages",
                              "ASIAN AND PACIFIC ISLAND LANGUAGES",".Navajo", ".Other Native North American languages"]
    elif textfileName == txtfile2006:
        languagegroupstolookfor = ["\nSPANISH AND SPANISH CREOLE", ".Italian",".German", ".Scandinavian languages",
                                  "\nASIAN AND PACIFIC ISLAND LANGUAGES",".Navajo", ".Other Native North American languages"]
    
    #Creating an emtpy dataframe 
    languagesperState = pd.DataFrame(np.nan, index=np.array(arrayNames), columns=np.array(languagegroupstolookfor)) #row index will be language and column index will be column index

    #Looping through the sheets and if the language is there, add it
    for area in arrayNames:

        df = pd.read_excel(textfileName, area, skiprows=skiprows).dropna()
        
        for language in languagegroupstolookfor:

            try:
                num = df[df['Unnamed: 0']==language][column1].item() #select row from langauges column which has certain language
            except:
                num = 0
            languagesperState.loc[area, language] = num
    
    #Putting Some Languages into Categories and Deleting the Individual Ones
    languagesperState['German, Scandinavian, Italian'] = languagesperState.iloc[:, 1:4].sum(axis=1)
    languagesperState['Native North American'] = languagesperState.iloc[:, 5:7].sum(axis=1)
    languagesperState.drop([".Italian","..German", ".Scandinavian languages",".Navajo", ".Other Native North American languages"], axis = 1, inplace = True) 
 
    return languagesperState

In [None]:
def createCountiesList(statename):
    """
    Creates a list of counties from the 2009-2013 counties list, when you input a state you want to analyze
    """
    textfileName =  "/home/alli/The-Mother-Tongue-of-US-Communities/raw_data/LanguageSpokenatHome_County_2009-2013.xls"
    countyNames = []
    x1 = pd.ExcelFile(textfileName)
    sheetnames = x1.sheet_names                        
    for sheet in sheetnames:
        if any(x in sheet for x in statename):
            
            countyNames.append(str(sheet))

    return countyNames

In [None]:
def createListofLanguageGroupsCounties(statename):
    """
    Takes in census year and returns dataframe which contains language groups and corresponding number of respondents
    who speak a language in the language group
    
    """
    #Reading Excel spreadsheet
    textfileName =  "/home/alli/The-Mother-Tongue-of-US-Communities/raw_data/LanguageSpokenatHome_County_2009-2013.xls"
    #Initializing languages which we are looking for
    languagegroupstolookfor = ["SPANISH AND SPANISH CREOLE", ".Italian","..German", ".Scandinavian languages",
                                 "ASIAN AND PACIFIC ISLAND LANGUAGES",".Navajo", ".Other Native North American languages"]
    #Initializing counties list 
    countyNames = createCountiesList(statename)
    #Create emtpy dataframe
    languagesperState = pd.DataFrame(np.nan, index= np.array(countyNames), columns=np.array(languagegroupstolookfor)) #row index will be language and column index will be column index
    #Looping through the sheets and if the language and county are there, add it
    x1 = pd.ExcelFile(textfileName)
    sheetnames = x1.sheet_names                        
    for sheet in sheetnames:

        if sheet in countyNames:

            df = pd.read_excel(textfileName, sheet, skiprows=[0,1,2,3]).dropna()
            
            for language in languagegroupstolookfor:
                
                try:
                    num = df[df['Unnamed: 0']==language]["Number of speakers1"].item() #select row from langauges column which has certain language
                    
                except:
                    num = 0
                
                languagesperState.loc[sheet, language] = num
                
     #Putting Some Languages into Categories and Deleting the Individual Ones
    languagesperState['German, Scandinavian, Italian'] = languagesperState.iloc[:, 1:4].sum(axis=1)
    languagesperState['Native North American'] = languagesperState.iloc[:, 5:7].sum(axis=1)
    languagesperState.drop([".Italian","..German", ".Scandinavian languages",".Navajo", ".Other Native North American languages"], axis = 1, inplace = True) 
 

    return languagesperState