# Project 3 - Japanese or Korean?
This project aims to help potential learners who wish to pick up a foreign languages decide on which language is more favourable. The scope is narrowed down to Japanese and Korean, languages that are similar on many fronts. See this [wikipedia page](https://en.wikipedia.org/wiki/Comparison_of_Japanese_and_Korean) for a comparison.

The project is successful as insights are obtained from the data analysis to make recommendations to potential learners, and 2 of the 3 machine learning models chosen accurately predicted more than 75% of new posts, which is a significant improvement from the baseline accuracy of 50.5%.

The project is divided into 3 python notebooks:
### (1) Web Scraping
The source materials are posts on 2 subreddits on learning Japanese and learning Korean:
- [LearnJapanese](https://www.reddit.com/r/LearnJapanese/new)
- [korean](https://www.reddit.com/r/korean/new)

1000 posts from each subreddit are scraped and processed.

### (2) Data Analysis
The posts on the 2 subreddits are analyzed for similarities and differences. In particular, the differences will be scrutinized to glean insights for potential learners to make their choice.

### (3) Classification by Machine Learning
Different machine learning algorithms are used to classify the posts into Japanese or Korean, to determine if the posts are sufficiently different to be accurately classified. The words that help the most with classifying the posts will also be analyzed.

# (1) Web Scraping

In [1]:
# imports
import pandas as pd
import requests
import re

# set options for pandas DataFrame display
pd.set_option('display.max_colwidth', 50)
pd.set_option('display.max_rows', 20)

In [2]:
# function to scrape for 1000 posts on each subreddit
url = "https://api.pushshift.io/reddit/search/submission"

def get_1000_posts(subreddit):
    params = {
        'subreddit': subreddit,
        'size': 100,
        'sort': 'asc',
        'after': 1609459200   # start datetime 2021-01-01 00:00:00 GMT
    }  
    res = requests.get(url, params)
    df = pd.DataFrame(res.json()['data'])[['subreddit', 'selftext', 'title', 'created_utc']]
    output = df
    
    for i in range(9):
        params = {
            'subreddit': subreddit,
            'size': 100,
            'sort': 'asc',
            'after': df['created_utc'][99]
        }
        res = requests.get(url, params)    
        df = pd.DataFrame(res.json()['data'])[['subreddit', 'selftext', 'title', 'created_utc']]
        output = pd.concat([output, df])
        
    return output.reset_index(drop=True)

In [3]:
# get dataframe for Learn Japanese
japanese = get_1000_posts('LearnJapanese')

In [4]:
# check dataframe
japanese

Unnamed: 0,subreddit,selftext,title,created_utc
0,LearnJapanese,\# Welcome to /r/learnjapanese!\n\n&amp;#x200B...,"WELCOME! Beginner Students, New /r/LearnJapane...",1609459220
1,LearnJapanese,[removed],Unique kanji studying technique using ASL (Ame...,1609461262
2,LearnJapanese,Anyone know if the Hobbit has been recorded in...,The Hobbit audiobook in Japanese?,1609466952
3,LearnJapanese,[removed],can anyone help me by typing an anime scene sc...,1609471961
4,LearnJapanese,[removed],"Now that you are fluent, how do you study?",1609474492
...,...,...,...,...
995,LearnJapanese,I have seen some variation in pronunciation. I...,Help me with pronunciation confusion.,1611469148
996,LearnJapanese,i found 2 one on anki website and one on this ...,wich one is the real 6k core deck,1611471874
997,LearnJapanese,I've been learning kanjis for almost a year no...,Kanji Writing Practice Books (in India),1611475374
998,LearnJapanese,"When we ask ""Nan nei sei desu ka."" Then we are...",I just started learning Japanese and i'm a lit...,1611475452


In [5]:
# get dataframe for korean
korean = get_1000_posts('korean')

In [6]:
# check dataframe
korean

Unnamed: 0,subreddit,selftext,title,created_utc
0,Korean,I am reading BillyGo book and stumbled across ...,How can I tell if a sentence is missing an opt...,1609462358
1,Korean,Legit as in if it's worth spending my money on...,Is KoreanClass101 legit?,1609466531
2,Korean,[removed],지다/더,1609469648
3,Korean,[removed],What is the best kind of content for all Korea...,1609474565
4,Korean,I CANNOT BELIEVE I did not think of this befor...,Big realization that I hope can be a help to e...,1609477688
...,...,...,...,...
995,Korean,So far I had only come across 지 being a negati...,Please confirm: real meaning of 지 and 마,1612209078
996,Korean,Is this an adverb and why is it different that...,Can someone explain 열심히 to me?,1612209219
997,Korean,I wanted to know how to write the sentences th...,Can someone help me to transcribe this sentence?,1612209440
998,Korean,"Hello. i know that the word ""it"" is not usuall...","""it"" in korean",1612211071


### Result
1000 posts successfully scraped for each subreddit.

## Data Cleaning

In [7]:
# check for empty cells and data types
print(" Japanese ".center(16, '='))
print(japanese.info())
print()
print(" Korean ".center(16, '='))
print(korean.info())

=== Japanese ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    1000 non-null   object
 1   selftext     990 non-null    object
 2   title        1000 non-null   object
 3   created_utc  1000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 31.4+ KB
None

==== Korean ====
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    1000 non-null   object
 1   selftext     982 non-null    object
 2   title        1000 non-null   object
 3   created_utc  1000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 31.4+ KB
None


Some 'selftext' are null, which is not an issue as the users likely wrote their entire post within the 'title'. We impute these cells with an empty string before combining 'title' and 'selftext' into a new feature 'post'. 

In [8]:
# create new feature 'post'

# replace NaN with empty string
japanese['selftext'].fillna('', inplace=True)
korean['selftext'].fillna('', inplace=True)

# concatenate 'title' with 'selftext'
japanese['post'] = japanese['title']+' '+japanese['selftext']
korean['post'] = korean['title']+' '+korean['selftext']

In [9]:
# check feature created
display(japanese.head())
display(korean.head())

Unnamed: 0,subreddit,selftext,title,created_utc,post
0,LearnJapanese,\# Welcome to /r/learnjapanese!\n\n&amp;#x200B...,"WELCOME! Beginner Students, New /r/LearnJapane...",1609459220,"WELCOME! Beginner Students, New /r/LearnJapane..."
1,LearnJapanese,[removed],Unique kanji studying technique using ASL (Ame...,1609461262,Unique kanji studying technique using ASL (Ame...
2,LearnJapanese,Anyone know if the Hobbit has been recorded in...,The Hobbit audiobook in Japanese?,1609466952,The Hobbit audiobook in Japanese? Anyone know ...
3,LearnJapanese,[removed],can anyone help me by typing an anime scene sc...,1609471961,can anyone help me by typing an anime scene sc...
4,LearnJapanese,[removed],"Now that you are fluent, how do you study?",1609474492,"Now that you are fluent, how do you study? [re..."


Unnamed: 0,subreddit,selftext,title,created_utc,post
0,Korean,I am reading BillyGo book and stumbled across ...,How can I tell if a sentence is missing an opt...,1609462358,How can I tell if a sentence is missing an opt...
1,Korean,Legit as in if it's worth spending my money on...,Is KoreanClass101 legit?,1609466531,Is KoreanClass101 legit? Legit as in if it's w...
2,Korean,[removed],지다/더,1609469648,지다/더 [removed]
3,Korean,[removed],What is the best kind of content for all Korea...,1609474565,What is the best kind of content for all Korea...
4,Korean,I CANNOT BELIEVE I did not think of this befor...,Big realization that I hope can be a help to e...,1609477688,Big realization that I hope can be a help to e...


In [10]:
# remove urls, '[removed]'
japanese['post'] = japanese['post'].map(lambda x: re.sub(r'\[removed\]', '', re.sub(r'http\S+', '', x)))
korean['post'] = korean['post'].map(lambda x: re.sub(r'\[removed\]', '', re.sub(r'http\S+', '', x)))

In [11]:
# remove numbers and punctuations
japanese['post'] = japanese['post'].map(lambda x: re.sub(r'[^\w\s]|\d', '', x))
korean['post'] = korean['post'].map(lambda x: re.sub(r'[^\w\s]|\d', '', x))

In [16]:
# change words to lowercase
japanese['post'] = japanese['post'].map(lambda x: x.lower())
korean['post'] = korean['post'].map(lambda x: x.lower())

In [17]:
# check dataframes
display(japanese.head())
display(korean.head())

Unnamed: 0,subreddit,selftext,title,created_utc,post
0,LearnJapanese,\# Welcome to /r/learnjapanese!\n\n&amp;#x200B...,"WELCOME! Beginner Students, New /r/LearnJapane...",1609459220,welcome beginner students new rlearnjapanese u...
1,LearnJapanese,[removed],Unique kanji studying technique using ASL (Ame...,1609461262,unique kanji studying technique using asl amer...
2,LearnJapanese,Anyone know if the Hobbit has been recorded in...,The Hobbit audiobook in Japanese?,1609466952,the hobbit audiobook in japanese anyone know i...
3,LearnJapanese,[removed],can anyone help me by typing an anime scene sc...,1609471961,can anyone help me by typing an anime scene sc...
4,LearnJapanese,[removed],"Now that you are fluent, how do you study?",1609474492,now that you are fluent how do you study


Unnamed: 0,subreddit,selftext,title,created_utc,post
0,Korean,I am reading BillyGo book and stumbled across ...,How can I tell if a sentence is missing an opt...,1609462358,how can i tell if a sentence is missing an opt...
1,Korean,Legit as in if it's worth spending my money on...,Is KoreanClass101 legit?,1609466531,is koreanclass legit legit as in if its worth ...
2,Korean,[removed],지다/더,1609469648,지다더
3,Korean,[removed],What is the best kind of content for all Korea...,1609474565,what is the best kind of content for all korea...
4,Korean,I CANNOT BELIEVE I did not think of this befor...,Big realization that I hope can be a help to e...,1609477688,big realization that i hope can be a help to e...


In [18]:
# drop duplicates
japanese.drop_duplicates(subset='post', inplace=True)
korean.drop_duplicates(subset='post', inplace=True)

In [19]:
# check number of rows remaining
print(f"Japanese: {japanese.shape}")
print(f"Korean: {korean.shape}")

Japanese: (995, 5)
Korean: (976, 5)


The Japanese dataframe had 5 posts removed due to duplication while the Korean dataframe had 23 posts removed, which is just 2%. There is sufficient data to proceed. 

In [20]:
# save 'post' to csv
japanese[['post']].to_csv('datasets/japanese.csv', index=False)
korean[['post']].to_csv('datasets/korean.csv', index=False)

We move on to the next python notebook (2) Data Analysis.