<a href="https://colab.research.google.com/github/ContextLab/storytelling-with-data/blob/master/data-stories/KoreanAmerican/losing_your_language_is_sad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Set Up

In [0]:
from textblob import TextBlob
import textwrap
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff

### Project Team

Jae Hong - Collected the datasets and helped find trends in the data.

Maddy Lee - Assisted data collection, finding trends, and completed documentation.

Mira Ram - Helped find trends, created the movie.

Bill Tang - Created graphs, Helped find trends.

### Background and Overview

In our Language Shift project, we found that by the third generation, most immigrant groups lose their ability to speak their mother tongue. We specifically focused on Korean immigrants to see how they felt about this loss in language. 

### Approach

We collected data through the Facebook group 'Subtle Korean Traits' and performed a sentiment analysis to see how they felt, as well as analyzed their language proficiency and what made them feel embarrased if they did based on their long answer responses. We also collected data from American/Non-immigrant Dartmouth students on American sentiment toward English as a comparison.

### Quick Summary

We found that Korean American sentiment was lower than non-immigrant American sentiment. People with basic proficiency in Korean reported more reasons for embarrasment regarding their usage of the language than other proficiency levels. Conversational profiency Korean Americans had the highest average polarity. 

### Data

In [0]:
subtle_korean_feelings = pd.read_csv('https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/subtle-korean-feelings-remastered.csv')
American_feelings = pd.read_csv("https://raw.githubusercontent.com/jaeshong/bob-the-weasel/master/American%20Language%20and%20Identity.csv")

### Analysis

In the following analysis, we graphed sentiment polarity of the Korean American and American responses to form a bar graph.

The second graph shows a histogram of the Korean American sentiment polarity and the third graph shows a histogram of American sentiment polarity.

In [0]:
# first, let's see if overall sentiment regarding language in koreans vs. americans are positive or negative, and if so, how they compare relatively
tot = 0.0
num = 0
atot = 0.0
anum = 0

def com_pol(comment):
  blob = TextBlob(comment)
  return blob.sentiment.polarity

def com_sub(comment):
  blob = TextBlob(comment)
  return blob.sentiment.subjectivity

def proficiency(row):
  if(row['Nonexistent'] == 1):
    return 'Nonexistent'
  elif(row['basic'] == 1):
    return 'basic'
  elif(row['conversational'] == 1):
    return 'conversational'
  elif(row['fluent'] == 1):
    return 'fluent'
  else:
    return 'Nonexistent'

def nproficiency(row):
  if(row['Nonexistent'] == 1):
    return 0
  elif(row['basic'] == 1):
    return 1
  elif(row['conversational'] == 1):
    return 2
  elif(row['fluent'] == 1):
    return 3
  else:
    return 0


subtle_korean_feelings['Polarity'] = subtle_korean_feelings['Comment'].map(com_pol)
American_feelings['Polarity'] = American_feelings['Comment'].map(com_pol)

subtle_korean_feelings['Subjectivity'] = subtle_korean_feelings['Comment'].map(com_sub)
American_feelings['Subjectivity'] = American_feelings['Comment'].map(com_sub)

subtle_korean_feelings['Proficiency'] = subtle_korean_feelings.apply(lambda row: proficiency(row), axis=1)
subtle_korean_feelings['nProficiency'] = subtle_korean_feelings.apply(lambda row: nproficiency(row), axis=1)

for index, row in subtle_korean_feelings.iterrows():
    sentence = row['Comment']
    blob = TextBlob(sentence)
    polarity = blob.sentiment.polarity
    tot += polarity
    num += 1
  
print(tot/num)
for index, row in American_feelings.iterrows():
    sentence = row['Comment']
    blob = TextBlob(sentence)
    polarity = blob.sentiment.polarity
    atot += polarity
    anum += 1

df = pd.DataFrame([['Koreans', tot/num], ['Americans', atot/anum]], columns=['Group', 'Polarity'])
fig = px.bar(df, x='Group', y='Polarity')
fig.show()

fig2 = ff.create_distplot([subtle_korean_feelings['Polarity']], ['Korean Polarity'], show_hist=False)
fig2.show()
fig3 = ff.create_distplot([American_feelings['Polarity']], ['American Polarity'], show_hist=False)
fig3.show()

0.1003828372495389


In this analysis, we graph polarity against Korean American's response to Korean language proficiency. 

proficiency: 0 - nonexistent, 1 - basic, 2 - conversational, 3 - fluent

In [0]:
# do people with higher polarity have higher faith in their proficiency?
fig4 = px.scatter(subtle_korean_feelings, x='nProficiency', y='Polarity', title="Proficiency versus Polarity", trendline="ols")
fig4.show()

avg_prof = [0, 0, 0, 0]
count = [0, 0, 0, 0]

for index, row in subtle_korean_feelings.iterrows():
    prof = row['nProficiency']
    avg_prof[prof] += row['Polarity']
    count[prof] += 1

for i in range(0, 4):
  avg_prof[i] = avg_prof[i] / count[i]

df2 = pd.DataFrame([[0, avg_prof[0]], [1, avg_prof[1]], [2, avg_prof[2]], [3, avg_prof[3]]], columns=['proficiency', 'average polarity'])
fig9 = px.line(df2, x='proficiency', y='average polarity', title="Average Polarity by proficiency")
fig9.show()

fig5 = ff.create_distplot([subtle_korean_feelings.loc[subtle_korean_feelings['nProficiency'] == 0]['Polarity']], ['Sentiment for no proficiency'], show_hist=False)
fig5.show()
fig6 = ff.create_distplot([subtle_korean_feelings.loc[subtle_korean_feelings['nProficiency'] == 1]['Polarity']], ['Sentiment for low proficiency'], show_hist=False)
fig6.show()
fig7 = ff.create_distplot([subtle_korean_feelings.loc[subtle_korean_feelings['nProficiency'] == 2]['Polarity']], ['Sentiment for medium proficiency'], show_hist=False)
fig7.show()
fig8 = ff.create_distplot([subtle_korean_feelings.loc[subtle_korean_feelings['nProficiency'] == 3]['Polarity']], ['Sentiment for high proficiency'], show_hist=False)
fig8.show()



pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In the following analysis, we created bar graphs of Korean American reported insecurities for the language abilities.

In [0]:
# what are the common reasons that people feel badly about their speaking skills, and do certain proficiencies struggle with different problems?

problems = subtle_korean_feelings.columns[8:18]
pr_idx = {}
for i in range(0, 10):
  pr_idx[i+8] = problems[i]
  pr_idx[problems[i]] = i + 8

prob_prof = {}

frequency = [0] * 10
for index, row in subtle_korean_feelings.iterrows():
    for i in range(8, 18):
      proficiency = row['Proficiency']
      if(row[i] == 1):
        frequency[i-8] += 1
        try:
          prob_prof[(pr_idx[i], proficiency)] += 1
        except KeyError:
          prob_prof[(pr_idx[i], proficiency)] = 1
probfreqdata = [[pr_idx[i+8], frequency[i]] for i in range(0, 10)]

probfreqbyprof = []
for k, v in prob_prof.items():
  probfreqbyprof.append([k[0], k[1], v])

df4 = pd.DataFrame(probfreqdata, columns=['Problem', 'Frequency'])
fig10 = px.bar(df4, x='Problem', y='Frequency', title="Total recorded insecurities")
fig10.show()

df3 = pd.DataFrame(probfreqbyprof, columns=['Problem', 'Proficiency', 'Frequency'])
fig11 = px.bar(df3, x='Problem', y='Frequency', color='Proficiency', barmode='group', category_orders={'Proficiency':['Nonexistent', 'basic', 'conversational', 'fluent']})
fig11.show()

### Interpretations and Conclusions

We found that Korean American sentiment was lower than non-immigrant American sentiment. Perhaps this is due to fact that many were not fluent in their mother tongue.

The average sentiment was highest for the conversational proficiency group. This was surprising as we expected the fluent group to have the highest polarity. Upon further investigation, we also noticed that the fluent group had more embarrassment than conversational in terms of cultural gap. Perhaps their fluency allows them to realize the gap between the two cultures more than for those who are conversational.

People with basic proficiency in Korean reported more reasons for embarrasment regarding their usage of the language than other proficiency levels. This could be due to their struggle to hold conversations with other speakers despite being able to understand a bit.

### Future Directions

- Collect more data on Korean American language and culture sentiment data.
- Use a different sentiment analysis. For example, some people used swear words in a positive light, but the analysis we used automatically rated swear words negatively.