<div style="text-align: right"> Christopher Hyek </div>
<div style="text-align: right"> 9/25/2019 </div>
<h1 align="center"> Capstone Project Part 2</div>
<h2 align="center"> Japanese NLTK Support </div>

# Index

[**Abstract**](#Abstract)

- [**Currently Completed Goals**](#Abstract)

- [**Pending Goals**](#Abstract)

- [**Where is the Web Scraper?**](#Abstract)

[**Imports**](#Imports)

[**What is NLTK?**](#1.0)

- [**Word Tokenization**](#1.1)

- [**Stop words**](#1.2)

- [**Stemming**](#1.3)

- [**Lemmatization**](#1.4)

- [**Conjugation checker**](#1.5)

- [**Named Entity Recognition**](#1.6)

- [**Chunking and Stop Words pt 2**](#1.7)

- [**Parts of Speech tagging**](#1.8)

[**Future Project Expansion**](#2.0)

[**Final Thoughts**](#3.0)

# Abstract
What this program is meant to address are the parts of the NLTK library that do not work well with Japanese text and turn them into data that can be easily used within that library. This is a fairly complex undertaking as NLTK is not particularly simple and there are many segments of it that I will have broken up throughout the project to be assessed and worked upon. 

This program is also going to attempt to make it so that the reader/user of this does not need to be familiar with Japanese beyond the short exerpts that I provide throughout the Jupyter Notebook. While I would always recommend familiarizing yourself with the language to some extent ahead of time, I find making this as accessible as possible one of my end goals.

Below I will have listed what the project currently has completed to some degree and what are the working goals in the future. While I do have a working understanding of the language through nearly a decade of study, I am by no means a linguist, an expert translator, or know fluent Japanese. If you find an error or a better way of doing something that is shown below then  by all means submit a change request and I hope that we both come out learning more from it.

#### Currently Completed Goals
    - Pre-language processing
    - Basic sentence processing
    - Basic Stemming / Lemmatization
    - Particle recognition
    - Word/Character lookup and reverse lookup
    - Basic Parts of Speech Tagging
    - Basic Tokenization

#### Pending Goals
    - Including more particles
    - Including complex Stemming / Lemmatization
    - Complex sentence processing
    - Stop Word inclusion
    - NER/NED inclusion
    - Better Parts of Speech Tagging
    
#### Where is the Web Scraper used? 
This segment of code is saved as another .py file found within this repository. I had purposely kept it separate due to the focus of this program being the usage of the data while the web scraper file is primarily for the cleaning of the data that is used for this project. I highly recommend checking it out and on top of that checking out [Jisho.org](https://jisho.org/) which is where (as of the most current version) where I obtained all of my Japanese data from.

## Imports

In [1]:
import nltk
import pandas as pd
import panel as pn

from panel.interact import interact, interactive, fixed, interact_manual
from panel import widgets

# 1.0

# What is NLTK?
It is a platform used within Python to work with human language to create statistical natural language processing (NLP). 

## Parts of NLTK brought up
There are many parts of NLTK but the ones that I would like to focus on for the time being are:

    - Stemming
    - Lemmatization
    - Tokenization
    - Stop Words 
    - Parts of Speech Classification
    - Named Entity Recognition
    
With each section that you will find below I will go into detail as to what these mean relative to my program and why they are important to the overall goal of becoming a supllement for the NLTK library.

## 1.1
## Word Tokenization pt 1
Before we jump directly into this we have to establish one of the more difficult aspects of Japanese and why it isn't currently supported by NLTK. Japanese does not have spaces like many other languages possess, so in order to actually split the words into process-able words we must first find out <i>how</i> to split the sentences. In what would be more closely considered to be tokenization of a sentence into words.

### Step 1 - Splitting by Script
A very blunt but effective way to initially split the set of characters found within a sentence are to split them into the three scripts that Japanese uses as well as by Non-Japanese characters, and punctuation. The reason we are splitting by this as our initial step is because there are relatively very few words that use multiple scripts at the same time.

<b> Disclaimer! </b>
Before we split hairs on what I'm referring to as scripts, let me explain what I mean by this. Japanese has three primary scripts that create the Japanese writing system. 
    
    - Hiragana  (ex: こんにちは - Hello)      
    - Katakana  (ex: コンピューター - Computer)  
    - Kanji     (ex: 科学 - Science)            
    
Hiragana and Katakana are technically syllaberies (representation of the phonetic pronunciation of the language), whereas Kanji is comprised of logographic characters (they usually represent a word or phrase although some only are used in conjuction with others to make a word or phrase). What many people consider to be the characters that are hard to undersand in Japanese are the Kanji, whereas Hiragana and Katakana generally are less difficult and there are about 100 of each compared to the 1,000's of Kanji. These three systems of writing are what I'll be describing as script from this point forward for simplicities' sake.

#### How and why do we split the script?
 

If a word were to possess two or more scripts within a single word, they are likely to be loan-words, verbs, adverbs, or adjectives. This list may seem like a lot of words but the largest group of words found within the data are nouns and they very rarely have more than one script so this type of split is helpful for discerning what is already the largest group of words.

### Script Splitting Function
Since Python is an ASCII programming language, the only real way to recognize the Japanese scripts are through the usage of utf-8 and hexadecimal. Thankfully the three Japanese scripts are clustered by their script type which makes categorization much easier.

The following four functions are meant to check the characters numerical value converted over from hexadecimal to see if the character we are looking at is part of one of these categories. 

In [2]:
def check_hiragana(number2):
    if (number2) >= 12353 and (number2) < 12439:
        return True

    return False

def check_katakana(number2):
    if (number2) >= 12449 and (number2) < 12541:
        return True
    
    return False

def check_kanji(number2):
    if (number2) >= 19968 and (number2) < 40880:
        return True
    
    return False

def check_j_punctuation(number2):
    if (number2) >= 12288 and (number2) < 12351:
        return True
    
    return False

In [3]:
class Jpn_splitter:

# Scripter_lite is meant to take the string that you provided it and to break it up into the following portions: Hiragana, 
# Katakana, Kanji, Punctuation, or otherwise. In an effort to not repeat itself most of the interior functions used in it
# are left above.
    
    def scripter_lite():
        test = input(prompt="Please insert a Japanese sentence: ")
        release = []
        for letter in range(len(test)):
            number = ord(test[letter])
            try:
                number2 = ord(test[letter +1])

        #         Hiragana If Statement
                if (number) >= 12353 and (number) < 12439:                  
                    if check_hiragana(number2):
                        release.append(test[letter])
                    else:
                        release.append(test[letter])
                        release.append(' ')                      
     
        #         Katakana If Statement
                elif (number) >= 12449 and (number) < 12541:                
                    if check_katakana(number2):
                        release.append(test[letter])
                    else:
                        release.append(test[letter])
                        release.append(' ')        

        #         Kanji If Statement     
                elif (number) >= 19968 and (number) < 40880:    
                    if check_kanji(number2):
                        release.append(test[letter])
                    else:
                        release.append(test[letter])
                        release.append(' ')       

        #        Punctuation If Statement              
                elif (number) >= 12288 and (number) < 12351:       
                    if check_j_punctuation(number2):
                        release.append(test[letter])
                    else:
                        release.append(test[letter])
                        release.append(' ') 

                else:
                    print('Program recieved non-Japanese Text')
            except:
                release.append(test[letter])
        return release    
    
    
# Scripter_extended is Scripter_lite but prints out a lot of statements regarding each step along the process. I would recommend
# using it at first to familiarize yourself with Japanese or to bug test issues and then to move to Scripter_lite since 
# it is faster and has less parts within it.

    def scripter_extended():
        test = input(prompt="Please insert a Japanese sentence: ")
        release = []
        for letter in range(len(test)):
            hexa = hex(ord(test[letter]))
            number = ord(test[letter])
            try:
                letter2 = letter + 1
                hexa2 = hex(ord(test[letter +1]))
                number2 = ord(test[letter +1])

        #         Hiragana If Statement
                if (number) >= 12353 and (number) < 12439:
                    print('Grouping {}:'.format(letter))
                    print('Is a Hiragana')
                    print(test[letter], hexa, number)
                    
                    if check_hiragana(number2):
                        release.append(test[letter])
                        print('Is also Hiragana')
                        print(test[letter + 1], hexa2, number2, '\n')
                    else:
                        release.append(test[letter])
                        release.append(' ')
                        print('Is not Hiragana')
                        print(test[letter + 1], hexa2, number2, '\n')                        

                        
        #         Katakana If Statement
                elif (number) >= 12449 and (number) < 12541:
                    print('Grouping {}:'.format(letter))
                    print('Is a Katakana')
                    print(test[letter], hexa, number)        
                    
                    if check_katakana(number2):
                        release.append(test[letter])
                        print('Is also Katakana')
                        print(test[letter + 1], hexa2, number2, '\n')
                    else:
                        release.append(test[letter])
                        release.append(' ')
                        print('Is not Katakana')
                        print(test[letter + 1], hexa2, number2, '\n')  
        

        #         Kanji If Statement     
                elif (number) >= 19968 and (number) < 40880:
                    print('Grouping {}:'.format(letter))
                    print('Is a Kanji')
                    print(test[letter], hexa, number)        
        
                    if check_kanji(number2):
                        release.append(test[letter])
                        print('Is also Kanji')
                        print(test[letter + 1], hexa2, number2, '\n')
                    else:
                        release.append(test[letter])
                        release.append(' ')
                        print('Is not Kanji')
                        print(test[letter + 1], hexa2, number2, '\n')        
        

        #        Punctuation If Statement              
                elif (number) >= 12288 and (number) < 12351:
                    print('Grouping {}:'.format(letter))
                    print('Is a Japanese Punctuation')
                    print(test[letter], hexa, number)        
        
                    if check_j_punctuation(number2):
                        release.append(test[letter])
                        print('Is also Japanese Puncuation')
                        print(test[letter + 1], hexa2, number2, '\n')
                    else:
                        release.append(test[letter])
                        release.append(' ')
                        print('Is not Japanese Punctuation')
                        print(test[letter + 1], hexa2, number2, '\n') 

                else:
                    print('Program recieved non-Japanese Text')
            except:
                release.append(test[letter])
                print(test[letter], number, hexa)
        return release

# This is the first method added that does not take a string and put in rudimentary spaces to the string. What the following
# method does is take that string and make it into a list based on the spaces as the cuts between the values put into it. 
# There is an option to include print statements but due to the size of the code I just commented them out instead of making
# another method within the class.


    def create_list(splitter):
        splitter = ''.join(splitter)
#         print(type(splitter))
#         print(splitter.split())
        return splitter.split()

In [4]:
def Jpn_list_creator():
    jpn = Jpn_splitter
    
    first_part = jpn.scripter_lite()
    second_part = jpn.create_list(first_part)
    return second_part

In [5]:
# For those that are interested in playing with the methods, here are a few sentences that you can practice with along with 
# their translations. 

test = '昨日は悪かった、でも今日はいいですよ。'        # "Yesterday I felt bad, but today I feel good."
test2 = '今、私はコンピューター科学を勉強しているよ。'  # "Right now I am studying Computer Science."
test3 = '私は火曜日に大学へ行きますね'                # "On Tuesday I go to college."

#### Disclaimer!!!!
Before moving on to the explanation of where these sentences work and don't work there is one major difference in Japanese that English doesn't have and that is particles. In Japanese there are things called particles that are used within a sentence to denote the subject, object, location, time, and many other things within a sentence. There are MANY particles but the primary take away you should get from this is that these particles are essentially stop words for NLTK.

In [6]:
# Test 1
Jpn_list_creator()

Please insert a Japanese sentence: 昨日は悪かった、でも今日はいいですよ


['昨日', 'は', '悪', 'かった', '、', 'でも', '今日', 'はいいですよ']

This sentence is almost where it needs to be to then be converted over to English. The third and fourth values comprise one word but it is a dual script word ('悪', 'かった' should be '悪かった') and the final value in the list is actually several particles attached to a noun and verb ('はいいですよ' should be 'は', 'いい', 'です', 'よ'). But with even this limited program we have a basic sentence pretty much already tokenized.

In [7]:
# Test 2
Jpn_list_creator()

Please insert a Japanese sentence: 今、私はコンピューター科学を勉強しているよ


['今', '、', '私', 'は', 'コンピューター', '科学', 'を', '勉強', 'しているよ']

This sentence is even more accurate than the first one because the only thing incorrect in this sentence is the second last value which has a particle connected to a verb ('しているよ' should be 'している', 'よ'). The reason that this one has fewer mistakes is because this sentence uses a word that is part of a noun-verb combination which means when we split them they just make the noun and verb and not a broken word.

In [8]:
# Test 3
Jpn_list_creator()

Please insert a Japanese sentence: 私は火曜日に大学へ行きますね


['私', 'は', '火曜日', 'に', '大学', 'へ', '行', 'きますね']

This sentence suffers from the same problem that the first two had in that the verb is split incorrectly and there are particles added to the end of the verb ('行', 'きますね' should be '行きます, 'ね'). 

### Result of Tests
The sentences seemed to be split pretty well despite the same primary issues coming up, which are words that contain multiple scripts and particles. The first thing we should probably work on is splitting the particles from the other words so that we can read the words within the sentence more easily.

# 1.2
# Particles, the Japanese Stop Words
There are a few facts about particles that you should know before we continue. They are almost always written in Hiragana with some having Kanji versions but they are archaic and rarely used. There are 70 listed in the Jiso.org website but we only use the nine most common ones within the program as of right now.

#### Another Reminder
What is a particle?

A particle in Japanese denotes what part of the sentence we are talking about. It'll indicate without any need for context what the subject of the sentence is, where it is occuring, at what time or going in X direction. They are not things you would find in English but are things that help make Japanese easier to read.

The ones that are used in our functions below are: 

は, が, に, で, へ, を, と, や, ね

#### Used to test the particle checker
The below test should list the sentence: 

'昨日は悪かったが、今日はともだちと日本語で話して、よくになりますよ。'

This sentence means: Yesterday was not good, but today I spoke Japanese with a friend and felt better.

The context of why I choose this sentence is that it allows for several tricky particle checks found at the end and beginning of the following words:

はともだちと  ---- Starts with a particle and ends with a particle ---- should be 'は', 'ともだち', 'と'

よくになりますよ - Ends with a particle but does not start with one, has a middle particle --- should be 'よく', 'に', 'なります', 'よ'

In [9]:
# copy and paste this sentence into the next cells input

test4 = '昨日は悪かったが、今日はともだちと日本語で話して、よくになりますよ。'

In [10]:
final_test4 = Jpn_splitter.scripter_lite()
ft4 = Jpn_splitter.create_list(final_test4)
print(ft4)

Please insert a Japanese sentence: 昨日は悪かったが、今日はともだちと日本語で話して、よくになりますよ。
['昨日', 'は', '悪', 'かったが', '、', '今日', 'はともだちと', '日本語', 'で', '話', 'して', '、', 'よくになりますよ', '。']


### Results
Based on the initial program used to split it shows the problems that I listed above as clear issues.

## Functions that check for particles
The following functions are meant to do a very light check after the create_list function we created previously. It will check the beginning or end of the sentences we made as well as remove any excess spaces we create.

In [11]:
def particle_check(lister, particle):
    new_list = []
    for w in lister:
        if w[:1] == particle:
            new_list.append(w[:1])
            new_list.append(w[1:])
        else:
            new_list.append(w)
    return new_list

def end_particle_check(lister, particle):
    new_list = []
    for w in lister:
        if w[-1:] == particle:
            new_list.append(w[:-1])
            new_list.append(w[-1:])
        else:
            new_list.append(w)
    return new_list 

def clean_out_blanks(lister):
    new_list = []
    for i in lister:
        if i == '':
            pass
        else:
            new_list.append(i)
    return new_list

Why does the cell below seem to break the "don't repeat yourself" mentality? 

Well it partially does that because there wasn't an easier way to write the code that didn't involve it. The point of each of these checks are that they take the list and then check or change a part of it and then pass it on to the next particle to do the same. But if we use a loop we'll continually be checking the same initial list that is marked as 'test' and not a revised list that has been changed by each of the particles as it goes through. We want to make sure that we get a revised list as we go through.

In [12]:
def particle_gauntlet(test):
    ha = particle_check(test, 'は')
    ga = particle_check(ha, 'が')
    ni = particle_check(ga, 'に')
    de = particle_check(ni, 'で')
    he = particle_check(de, 'へ')
    ya = particle_check(he, 'や')
    yo = end_particle_check(ya, 'よ')
    ne = end_particle_check(yo, 'ね')
    ga2 = end_particle_check(ne, 'が')
    to = end_particle_check(ga2, 'と')
    
    result = clean_out_blanks(to)
    
    return result 

Let's take a few of the test variables we had above and re-run them through our new functions to see what comes out from them relative to what they were before.

In [13]:
# use this for the input: 昨日は悪かった、でも今日はいいですよ。

new_test = Jpn_splitter.scripter_lite()
nt1 = Jpn_splitter.create_list(new_test)
nt1

Please insert a Japanese sentence: 昨日は悪かった、でも今日はいいですよ。


['昨日', 'は', '悪', 'かった', '、', 'でも', '今日', 'はいいですよ', '。']

In [14]:
particle_gauntlet(nt1)

['昨日', 'は', '悪', 'かった', '、', 'で', 'も', '今日', 'は', 'いいです', 'よ', '。']

While the particle checking function did its job and split the particles from the words found within the sentence, it did a bit too good of a job in that it split a particle ('でも' to 'で', 'も') into two separate values. This kind of mistake is more of an oversight than a problem but it'll be fixed with future renditions. The only other problems that this sentence now has is that the adverb ('悪', 'かった') should be one value still and that the noun verb combo ('いいです') should be split in half ('いい', 'です') which are the two problems that were recognized early on so we're one less problem than we started with.

In [15]:
# use this for the input: 今、私はコンピューター科学を勉強しているよ

new_test2 = Jpn_splitter.scripter_lite()
nt2 = Jpn_splitter.create_list(new_test2)
print(nt2)

Please insert a Japanese sentence: 今、私はコンピューター科学を勉強しているよ
['今', '、', '私', 'は', 'コンピューター', '科学', 'を', '勉強', 'しているよ']


In [16]:
a = particle_gauntlet(nt2)
print(a)

['今', '、', '私', 'は', 'コンピューター', '科学', 'を', '勉強', 'している', 'よ']


While the difference between the two cells above are minute, the particles are separated from each of the words completely and the only thing left to do with this example would be to use either stemming or lemmatization and then convert it all to English once we remove the stop words.

In [17]:
# use this for the input: 私は火曜日に大学へ行きますね

new_test3 = Jpn_splitter.scripter_lite()
nt3 = Jpn_splitter.create_list(new_test3)
print(nt3)

Please insert a Japanese sentence: 私は火曜日に大学へ行きますね
['私', 'は', '火曜日', 'に', '大学', 'へ', '行', 'きますね']


In [18]:
a = particle_gauntlet(nt3)
print(a)

['私', 'は', '火曜日', 'に', '大学', 'へ', '行', 'きます', 'ね']


Same as the second example, this one has the particles split from each of the words and it is ready for stemming/lemmatization followed by stop word removal and then translation.

## Final Results
We managed to create within a few meaningful methods a way to take the particles that are found within Japanese and split them from other words in the sentence. We haven't removed the stop words just yet with these functions but let's move onto the hardest part of the project and we'll revisit stop words later on.

# 1.3
# Stemming
While this is arguably the hardest section of the NLTK libary to work with in Japanese on we'll discuss what the scope is and what we are trying to achieve with it.

Each Japanese verb can be broken up into a few different categories, there are about 15 different categories of verb conjugation and within each one of those they hold over 20 different end stems for each verb. So that means without even providing the exhaustive list of these we have over 300 combinations of conjugations that we have to account for before we even begin to move towards a halfway exhaustive list.

To keep things semi realistic we will be looking over only 20 of the conjugations and 14 verb types.


## How will these conjugations come into play?
I created a dictionary of dictionaries to hold all of the conjunction endinds and with a couple functions we'll be able to combine the stem and these endings as we check words to verify that it is a specific word.

In [19]:
cd

C:\Users\Atlas


In [20]:
cd Module 5 Capstone Project

C:\Users\Atlas\Module 5 Capstone Project


In [21]:
cd Verb Tag Files v2

C:\Users\Atlas\Module 5 Capstone Project\Verb Tag Files v2


In [22]:
df = pd.read_csv('jisho_dict_v2.csv')
df = df.drop('Unnamed: 0', axis = 1)
df.head()

Unnamed: 0,word,pronunciation,common tag,jlpt tag,meanings_wrapper,details_href,verb_tag
0,学校,がっこう,Common word,JLPT N5,"{'Noun': 'school', 'Place': 'Gakkou', 'Wikiped...",https://jisho.org/word/学校,Not Verb
1,川,かわ,Common word,JLPT N5,"{'Noun': 'river; stream', 'Suffix': 'River; th...",https://jisho.org/word/川,Not Verb
2,手,て,Common word,JLPT N5,"{'Noun': 'hand; arm', 'Noun, Noun - used as a ...",https://jisho.org/word/手,Not Verb
3,戸,と,Common word,JLPT N5,"{'Noun': 'door (esp. Japanese-style)', 'Place'...",https://jisho.org/word/戸,Not Verb
4,眼鏡,めがね,Common word,JLPT N5,"{'Noun': 'glasses; eyeglasses; spectacles', 'P...",https://jisho.org/word/眼鏡,Not Verb


## Creating the Stem Column
Since the difference between Stemming and Lemmatization is how the words are broken up, a stem for a Japanese word would be the part that doesn't change due to conjugation (which is everything but the last character in Japanese), where for Lemmatization we would just take the word and convert it to what is known as dictionary form for Japanese.

#### For reference
The above is technically correct for a majority of verb conjugation groups EXCEPT 'suru' verbs which will be excluded below and given their own tag. The reason why they are excluded is primarily because they are the Noun + Verb combination of "To do Noun as an action". Meaning that the suru verb words listed above are just nouns and do not have a conjugation at the end of them, the 'to do' verb is what conjugates those.

In [23]:
verb_list = df.loc[(df.verb_tag != 'Not Verb') & (df.verb_tag != 'suru'), 'word']

In [24]:
new_row = []
for row in verb_list:
    print(row[0:-1])
    new_row.append(row[0:-1])

上げ
走
飲
成
生き
流れ
有
言
出来
見
知
持
話
買
読
出
取
使
待
作
行
乗
終わ
寝
泳
立
着
呼
掛け
歌
上
頼
借り
飛
売
休
降り
止ま
洗
切
返
押
勤め
張
渡
浴び
撮
締め
居
会
分か
聞
書
入
置
住
食べ
歩
働
着
教え
降
死
帰
忘れ
出かけ
掛か
起き
座
入れ
疲れ
開け
見せ
違
付け
覚え
困
生まれ
始ま
貸
弾
遊
無く
渡
消
閉め
晴れ
曲が
消え
吹
引
習
脱
並
吸
閉ま
咲
並べ
曇
磨
鳴
差
答え
要
煮
茹で
思
見え
考え
始め
受け
探
急
選
送
怒
払
合
決め
笑
喜
思い出
落ち
楽し
間に合
訪ね
進
向か
通
起こ
運
盗
残
上が
致
逃げ
役に立
行
見つか
打
増え
迎え
集め
集ま
比べ
落と
壊れ
倒れ
壊
下が
決ま
解け
熟
跳び上が
摘
埋め
擦
退
せが
凝
見とれ
彷徨
萎れ
縺れ
惚れ
瞬
決め付け
選り分け
事によ
燥
捥
窶れ
呆け
選
引きつ
くたば
灯
であ
為れ
為せ
事があ
とな
事にな
たが
上手くい
風邪をひ
ちゃ
手に入れ
年をと
に依
でもあ
身につけ
目を通
目が覚め
手に入
と
責任を持
顔を出
ばれ
様にな
によって異な
ばら
絵を描
擤
顔を潰
突っかけ
破れ
千切
言付け
来
乗り越え
語ら
潜
握ら
伸び悩
先駆け
懐かし
押し切
低め
突き上げ
表立
分か
報われ
位置付け
撃ち止め
彩
追っかけ
行き過ぎ
揺るが
見入
怒鳴り込
洗い上げ
閊え
仄めか
極め
反
上せ
逆上せ
食い下が
戒め
憂え
蒸か
寝返
引き締ま
力
立て替え
看
追い抜
擽
群れ
奮
話し込
暈け
執
食み出
見届け
薄ま
謳
興
惚け
思い上が
似せ
煮込
丸ま
愛
し続け
絞り込
打ち付け
重
競
預け入れ
印象付け
抑え
荒らげ
逸れ
響め
畝
準え
包
め
戯れ
省み
ことが出来
生まれ育
そそられ
一味違
並び替え
大きくな
大人にな
別れ
申し上げ
引っ越
下げ
乗り換え
通
騒
踏
冷え
飾
暮れ
申
写
乱れ
沸か
漬け
呉れ
過ぎ
止め
遅れ
続け
手伝
見つけ
驚
開
眠
勝
聞こえ
尋ね
慣れ
泣
似
戻
調べ
伝え
無くな
動
知らせ
続
鳴
負け
亡くな
治
回
間違え
捨て
育て
褒め
塗
泊ま
片付け
止

In [25]:
new_dict = zip(verb_list, new_row)

In [26]:
stem_df = pd.DataFrame(new_dict)

In [27]:
stem_df.head()

Unnamed: 0,0,1
0,上げる,上げ
1,走る,走
2,飲む,飲
3,成る,成
4,生きる,生き


In [28]:
stem_df.columns = ['word', 'stem']

In [29]:
stem_df.head()

Unnamed: 0,word,stem
0,上げる,上げ
1,走る,走
2,飲む,飲
3,成る,成
4,生きる,生き


In [30]:
merged_df = pd.merge(df, stem_df, how='left', on='word')
merged_df['stem'] = merged_df.stem.fillna('No Stem')
merged_df.loc[(merged_df.verb_tag == 'suru'), 'stem'] = 'suru verb'

In [31]:
# The following head is this large to show each of the changes done to the dataframe

merged_df.head(25)

Unnamed: 0,word,pronunciation,common tag,jlpt tag,meanings_wrapper,details_href,verb_tag,stem
0,学校,がっこう,Common word,JLPT N5,"{'Noun': 'school', 'Place': 'Gakkou', 'Wikiped...",https://jisho.org/word/学校,Not Verb,No Stem
1,川,かわ,Common word,JLPT N5,"{'Noun': 'river; stream', 'Suffix': 'River; th...",https://jisho.org/word/川,Not Verb,No Stem
2,手,て,Common word,JLPT N5,"{'Noun': 'hand; arm', 'Noun, Noun - used as a ...",https://jisho.org/word/手,Not Verb,No Stem
3,戸,と,Common word,JLPT N5,"{'Noun': 'door (esp. Japanese-style)', 'Place'...",https://jisho.org/word/戸,Not Verb,No Stem
4,眼鏡,めがね,Common word,JLPT N5,"{'Noun': 'glasses; eyeglasses; spectacles', 'P...",https://jisho.org/word/眼鏡,Not Verb,No Stem
5,煙草,たばこ,Common word,JLPT N5,"{'Noun': 'tobacco; cigarette; cigaret; cigar',...",https://jisho.org/word/煙草,Not Verb,No Stem
6,赤,あか,Common word,JLPT N5,"{'Noun': 'Red (i.e. communist)', 'No-adjective...",https://jisho.org/word/赤,Not Verb,No Stem
7,仕事,しごと,Common word,JLPT N5,"{'Noun, Suru verb, No-adjective': 'work; job; ...",https://jisho.org/word/仕事,suru,suru verb
8,英語,えいご,Common word,JLPT N5,"{'Noun, No-adjective': 'English (language)', '...",https://jisho.org/word/英語,Not Verb,No Stem
9,問題,もんだい,Common word,JLPT N5,"{'Noun': 'question (e.g. on a test); problem',...",https://jisho.org/word/問題,Not Verb,No Stem


### Final Results
We have managed to have denotation for the verb type it is as well as a column exclusively there to list off the stem of a verb. The only thing left to work on is when the word in the column or any of it's conjugations are found that we then provide the stem instead of one of those and we'll have a successful Stemming method.

Now that we have it working we should include some way of finding the words that we plug in from Japanese or English so that we can start doing cross references. The following three functions will help with this.

In [32]:
# This method will check to see if the word is contained at all wthin the list of English words

def english_definition_search():
    meaning = input()
    definitions = merged_df[merged_df['meanings_wrapper'].str.contains(meaning, na=False)]
    return definitions

# This will check to see if the character you are showing is a word within the dataframe and list it, if it isn't 
# listed it will show you a print statement saying otherwise.

def jpn_kanji_search():
    kanji = input()
    if kanji in merged_df['word'].values:
        kanji_search = merged_df.loc[(merged_df['word'] == kanji)]
    else:
        print('Those kanji do not show up in that order')
    return kanji_search

# The method will take in a Kanji and will show you anything that has that character within it.

def related_kanji_search():
    kanji = input()
    related_words = merged_df[merged_df['word'].str.contains(kanji, na=False)]
    return related_words

Below are examples on how to run these searches. We will have the word 'mouth' be used and the character for mouth is '口'.

Kanji search looks for exactly what we have and tries to match it.

In [33]:
jpn_kanji_search()

口


Unnamed: 0,word,pronunciation,common tag,jlpt tag,meanings_wrapper,details_href,verb_tag,stem
68,口,くち,Common word,JLPT N5,"{'Noun': 'mouth', 'Suffix, Counter': 'opening;...",https://jisho.org/word/口,Not Verb,No Stem


Related Kanji search finds everything that shares that character and displays it.

In [34]:
related_kanji_search()

口


Unnamed: 0,word,pronunciation,common tag,jlpt tag,meanings_wrapper,details_href,verb_tag,stem
68,口,くち,Common word,JLPT N5,"{'Noun': 'mouth', 'Suffix, Counter': 'opening;...",https://jisho.org/word/口,Not Verb,No Stem
202,人口,じんこう,Common word,JLPT N4,"{'Noun': 'population', 'Wikipedia definition':...",https://jisho.org/word/人口,Not Verb,No Stem
655,蛇口,じゃぐち,Common word,JLPT N2,"{'Noun': 'faucet; tap', 'Place': 'Jaguchi', 'W...",https://jisho.org/word/蛇口,Not Verb,No Stem
1175,口紅,くちべに,Common word,JLPT N2,"{'Noun, No-adjective': 'lipstick', 'Wikipedia ...",https://jisho.org/word/口紅,Not Verb,No Stem
1676,口径,こうけい,Common word,Not required,"{'Noun': 'aperture; bore; calibre; caliber', '...",https://jisho.org/word/口径,Not Verb,No Stem
1932,口笛,くちぶえ,Common word,Not required,"{'Noun': 'whistle (sound made with the lips)',...",https://jisho.org/word/口笛,Not Verb,No Stem
2890,秋口,あきぐち,Common word,Not required,{'Noun': 'beginning of autumn; beginning of fa...,https://jisho.org/word/秋口,Not Verb,No Stem
2989,南口,みなみぐち,Common word,Not required,{'Noun': 'south entrance'},https://jisho.org/word/南口,Not Verb,No Stem
3076,西口,にしぐち,Common word,Not required,{'Noun': 'west entrance'},https://jisho.org/word/西口,Not Verb,No Stem
3096,切り口,きくち,Common word,Not required,"{'Noun': 'cut end; section; opening; slit', 'O...",https://jisho.org/word/切り口,Not Verb,No Stem


And English search shows any definition that would contain that word in it.

In [35]:
english_definition_search()

mouth


Unnamed: 0,word,pronunciation,common tag,jlpt tag,meanings_wrapper,details_href,verb_tag,stem
68,口,くち,Common word,JLPT N5,"{'Noun': 'mouth', 'Suffix, Counter': 'opening;...",https://jisho.org/word/口,Not Verb,No Stem
1906,ぽかん,Empty,Common word,Not required,"{""Adverb taking the 'to' particle"": 'vacantly;...",https://jisho.org/word/ぽかん,Not Verb,No Stem
3227,開口,かいこう,Common word,Not required,"{'Noun, Suru verb': 'opening; aperture (e.g. c...",https://jisho.org/word/開口,suru,suru verb
4232,大口,おおぐち,Common word,Not required,"{'Noun, No-adjective': 'big mouth', 'Other for...",https://jisho.org/word/大口,Not Verb,No Stem
5315,口コミ,Dual-script word,Common word,Not required,"{'Noun': 'word of mouth', 'Wikipedia definitio...",https://jisho.org/word/口コミ,Not Verb,No Stem
5885,ベルモット,Empty,Common word,Not required,"{'Noun': 'vermouth', 'Wikipedia definition': '...",https://jisho.org/word/ベルモット,Not Verb,No Stem
8920,お喋り,しゃべ,Common word,JLPT N3,"{'Noun, Suru verb': 'chattering; talk; idle ta...",https://jisho.org/word/お喋り,suru,suru verb
8938,含む,Dual-script word,Common word,JLPT N3,"{'Godan verb with mu ending, Transitive verb':...",https://jisho.org/word/含む,mu,含
9645,塞ぐ,Dual-script word,Common word,JLPT N2,"{'Godan verb with gu ending, Transitive verb':...",https://jisho.org/word/塞ぐ,gu,塞
11350,入口,いりぐち,Common word,JLPT N5,"{'Noun, No-adjective': 'entrance; entry; gate;...",https://jisho.org/word/入口,Not Verb,No Stem


# 1.4
# Lemmatization
Surprisingly we do not need to do this as Japanese already has it built in. Every non-'suru' verb listed on the word column is already in the lemmatized form and for 'suru' verbs they are noun verbs so they don't need to be lemmatized because the word 'suru' is in the dictionary as well as it's own word and it is also lemmatized.

So this is probably the easiest category to work with in the NLTK library so far.

# 1.5
# Making a Conjugation Checker
One major difference between English and Japanese that I want to emphasize and it is that Japanese has <b>MANY</b> conjugations for its verbs. So much so that there is pretty much a list for each type. Below is just a casual list of some of the conjugations that come up and even then this isn't really including any of the combinations that are possible as well.

List of Verb Conjugations:

    1. Dictionary Form
        - Positive
        - Negative
    2. Polite Form
        - Positive
        - Negative
    3. Negative Form
        - Positive
        - Negative
    4. 'Te' Form
        - Positive
        - Negative
    5. Past Tense Form
        - Positive
        - Negative
    6. Potential Form
        - Positive
        - Negative
    7. Conditional Form (not found in cells below)
        - Positive
        - Negative
    8. Volitional Form (not found in cells below)
        - Positive
        - Negative
    9. Passive Form
        - Positive
        - Negative
    10. Causitive Form
        - Positive
        - Negative
    11. Causitive Passive Form
        - Positive
        - Negative
    12. Imperitive Form
        - Positive
        - Negative
    
Since many Japanese words use the same conjugations, I will group them by that with a tag on these verbs in the dataframe. After creating these lists, it will be easy to have the verbs redirected and stemmed to a shorter and more readable form. And we'll be making sure that we can recognize these words using the following...


### Step 2 - Recognizing Particles
Now that we have a very rough split between the scripts we now should find Japanese particles. These are suffixes and short words that represent parts of speech within the Japanese language. It is very important that we split these from the initial splits that we made because they will help us with tagging our parts of speech later.


## 2.
## Lemmatization and Stemming 
Before we begin with word recognition there is one major difference between English and Japanese that I want to emphasize and it is that Japanese has <b>MANY</b> conjugations for its verbs. So much so that there is a list of ones that almost every verb has that I will either need to make separate values for *or* that I could just have be their own stem and not have to create a new value for.

<b>Disclaimer!</b> I will primarily focus on Lemmatization and not Stemming for this project.

List of Verb Conjugations (not including past tense or negative for most of these):

    1. Dictionary Form
        - Positive
        - Negative
    2. Polite Form
        - Positive
        - Negative
    3. Negative Form
        - Positive
        - Negative
    4. 'Te' Form
        - Positive
        - Negative
    5. Past Tense Form
        - Positive
        - Negative
    6. Potential Form
        - Positive
        - Negative
    7. Conditional Form
        - Positive
        - Negative
    8. Volitional Form
        - Positive
        - Negative
    9. Passive Form
        - Positive
        - Negative
    10. Causitive Form
        - Positive
        - Negative
    11. Causitive Passive Form
        - Positive
        - Negative
    12. Imperitive Form
        - Positive
        - Negative
    
Since many Japanese words use the same conjugations, I will group them by that with a tag on these verbs in the dataframe. After creating these lists, it will be easy to have the verbs redirected and stemmed to a shorter and more readable form. And we'll be making sure that we can recognize these words using the following...

## 1.2
## Language Processing pt 2
    

### Step 3 - Recognizing Words with Trigrams
I want to preface this with a very firm statement that in almost <b>every</b> case we will have split too much. And how we will be fixing this is by making a trigram of the list that we create from our splits with the explicit task of recognizing common prefixes and suffixes within Japanese. 

Japanese Prefixes are broken up into two categories: 

    - Honorific prefixes
        - E.g. お when put in front of words is saying 'The honorable __'
    - Characteristic prefixes
        - E.g. 再 (さい) when put in front of a word means repeating/again " ___ again "

Japanese Suffixes are broken up into four categories (for the sake of this project):

    - Adjective suffixes
        1. い adjectives are one of the most common ones and all end in い so they are easy to spot.
        2. な adjectives are also easy to spot with a few minor exceptions by they all end in な.
        3. の adjectives are the most rare of the three and due to this also being a particle we will have a harder time
               recognizing these.
        
    - Adverb suffixes
        Not every adverb will be an issue but the ones we must watch out for are the following:
            1. い adjectives can become adverbs that end in く.
            2. な adjectives can become adverbs that end in に.
    
    - Verb suffixes 
        1. Ichidan verbs are arguably the easiest to deal with in that they have one set way to conjugate.
        2. Godan verbs are the hardest in that there are five main bases and they conjugate differently.
        3. Irregular verbs are the smallest group and when making the code should be the first ones to find so they don't 
                get mixed up with Ichidan or Godan verbs.
           
    - Name/Honorific suffixes
        These are limited in number but would help immensely with figuring out N.E.R. later on since the only usage for 
                these are with names.
                


In [36]:
conjugation_dict = {
"ichidan_verb_conj": {"Non-past Aff": "る", "Non-past Neg": "ない", "Non-past, Pol Aff": "ます", 
"Non-past, Pol Neg": "ません", "Past Aff": "た", "Past Neg": "なかった", "Past, Pol Aff": "ました", 
"Past, Pol Neg": "ませんでした", "Te Form, Aff": "て", "Te form, Neg": "なくて", "Potential Aff": "られる", 
"Potential Neg": "られない", "Passive Aff": "られる", "Passive Neg": "られない", "Causitive Aff": "させる", 
"Causitive Neg": "させない", "Causitive Passive Aff": "させられる", "Causitive Passive Neg": "させられない", 
"Imperative Aff": "ろ", "Imperative Neg": "ろな"}, 

"godan_aru_conj": {"Non-past Aff": "る", "Non-past Neg": "ない", "Non-past, Pol Aff": "ります", 
"Non-past, Pol Neg": "りません", "Past Aff": "った", "Past Neg": "なかった", "Past, Pol Aff": "りました", 
"Past, Pol Neg": "りませんでした", "Te Form, Aff": "って", "Te form, Neg": "らなくて", "Potential Aff": "れる", 
"Potential Neg": "れない", "Passive Aff": "られる", "Passive Neg": "られない", "Causitive Aff": "らせる", 
"Causitive Neg": "らせない", "Causitive Passive Aff": "らせられる", "Causitive Passive Neg": "らせられない", 
"Imperative Aff": "れ", "Imperative Neg": "るな"},

"godan_bu_conj": {"Non-past Aff": "ぶ", "Non-past Neg": "ばない", "Non-past, Pol Aff": "びます", 
"Non-past, Pol Neg": "びません", "Past Aff": "んだ", "Past Neg": "ばなかった", "Past, Pol Aff": "びました", 
"Past, Pol Neg": "びませんでした", "Te Form, Aff": "んで", "Te form, Neg": "ばなくて", "Potential Aff": "べる", 
"Potential Neg": "べない", "Passive Aff": "ばれる", "Passive Neg": "ばれない", "Causitive Aff": "ばせる", 
"Causitive Neg": "ばせない", "Causitive Passive Aff": "ばせられる", "Causitive Passive Neg": "ばせられない", 
"Imperative Aff": "べ", "Imperative Neg": "ぶな"},

"godan_gu_conj": {"Non-past Aff": "ぐ", "Non-past Neg": "がない", "Non-past, Pol Aff": "ぎます", 
"Non-past, Pol Neg": "ぎません", "Past Aff": "いだ", "Past Neg": "がなかった", "Past, Pol Aff": "ぎました", 
"Past, Pol Neg": "ぎませんでした", "Te Form, Aff": "いで", "Te form, Neg": "がなくて", "Potential Aff": "げる", 
"Potential Neg": "げない", "Passive Aff": "がれる", "Passive Neg": "がれない", "Causitive Aff": "がせる", 
"Causitive Neg": "がせない", "Causitive Passive Aff": "がせられる", "Causitive Passive Neg": "がせられない", 
"Imperative Aff": "げ", "Imperative Neg": "ぐな"},

"godan_ku_conj": {"Non-past Aff": "く", "Non-past Neg": "かない", "Non-past, Pol Aff": "きます", 
"Non-past, Pol Neg": "きません", "Past Aff": "いた", "Past Neg": "かなかった", "Past, Pol Aff": "きました", 
"Past, Pol Neg": "きませんでした", "Te Form, Aff": "いて", "Te form, Neg": "かなくて", "Potential Aff": "ける", 
"Potential Neg": "けない", "Passive Aff": "かれる", "Passive Neg": "かれない", "Causitive Aff": "かせる", 
"Causitive Neg": "かせない", "Causitive Passive Aff": "かせられる", "Causitive Passive Neg": "かせられない", 
"Imperative Aff": "け", "Imperative Neg": "くな"},

"godan_iku_conj": {"Non-past Aff": "く", "Non-past Neg": "かない", "Non-past, Pol Aff": "きます", 
"Non-past, Pol Neg": "きません", "Past Aff": "った", "Past Neg": "かなかった", "Past, Pol Aff": "きました", 
"Past, Pol Neg": "きませんでした", "Te Form, Aff": "って", "Te form, Neg": "かなくて", "Potential Aff": "ける", 
"Potential Neg": "けない", "Passive Aff": "かれる", "Passive Neg": "かれない", "Causitive Aff": "かせる", 
"Causitive Neg": "かせない", "Causitive Passive Aff": "かせられる", "Causitive Passive Neg": "かせられない", 
"Imperative Aff": "け", "Imperative Neg": "くな"},

"godan_mu_conj": {"Non-past Aff": "む", "Non-past Neg": "まない", "Non-past, Pol Aff": "みます", 
"Non-past, Pol Neg": "みません", "Past Aff": "んだ", "Past Neg": "まなかった", "Past, Pol Aff": "みました", 
"Past, Pol Neg": "みませんでした", "Te Form, Aff": "んで", "Te form, Neg": "まなくて", "Potential Aff": "める", 
"Potential Neg": "めない", "Passive Aff": "まれる", "Passive Neg": "まれない", "Causitive Aff": "ませる", 
"Causitive Neg": "ませない", "Causitive Passive Aff": "ませられる", "Causitive Passive Neg": "ませられない", 
"Imperative Aff": "め", "Imperative Neg": "むな"},

"godan_nu_conj": {"Non-past Aff": "ぬ", "Non-past Neg": "なない", "Non-past, Pol Aff": "にます", 
"Non-past, Pol Neg": "にません", "Past Aff": "んだ", "Past Neg": "ななかった", "Past, Pol Aff": "にました", 
"Past, Pol Neg": "にませんでした", "Te Form, Aff": "んで", "Te form, Neg": "ななくて", "Potential Aff": "ねる", 
"Potential Neg": "ねない", "Passive Aff": "なれる", "Passive Neg": "なれない", "Causitive Aff": "なせる", 
"Causitive Neg": "なせない", "Causitive Passive Aff": "なせられる", "Causitive Passive Neg": "なせられない", 
"Imperative Aff": "ね", "Imperative Neg": "ぬな"},

"godan_ru_conj": {"Non-past Aff": "る", "Non-past Neg": "らない", "Non-past, Pol Aff": "ります", 
"Non-past, Pol Neg": "りません", "Past Aff": "った", "Past Neg": "らなかった", "Past, Pol Aff": "りました", 
"Past, Pol Neg": "りませんでした", "Te Form, Aff": "って", "Te form, Neg": "らなくて", "Potential Aff": "れる", 
"Potential Neg": "れない", "Passive Aff": "られる", "Passive Neg": "られない", "Causitive Aff": "らせる", 
"Causitive Neg": "らせない", "Causitive Passive Aff": "らせられる", "Causitive Passive Neg": "らせられない", 
"Imperative Aff": "れ", "Imperative Neg": "るな"},

"godan_su_conj": {"Non-past Aff": "す", "Non-past Neg": "さない", "Non-past, Pol Aff": "します", 
"Non-past, Pol Neg": "しません", "Past Aff": "した", "Past Neg": "さなかった", "Past, Pol Aff": "しました", 
"Past, Pol Neg": "しませんでした", "Te Form, Aff": "して", "Te form, Neg": "さなくて", "Potential Aff": "せる", 
"Potential Neg": "せない", "Passive Aff": "される", "Passive Neg": "させない", "Causitive Aff": "させる", 
"Causitive Neg": "させない", "Causitive Passive Aff": "させられる", "Causitive Passive Neg": "させられない", 
"Imperative Aff": "せ", "Imperative Neg": "すな"},


"godan_tsu_conj": {"Non-past Aff": "つ", "Non-past Neg": "たない", "Non-past, Pol Aff": "ちます", 
"Non-past, Pol Neg": "ちません", "Past Aff": "った", "Past Neg": "たなかった", "Past, Pol Aff": "ちました", 
"Past, Pol Neg": "ちませんでした", "Te Form, Aff": "って", "Te form, Neg": "たなくて", "Potential Aff": "てる", 
"Potential Neg": "てない", "Passive Aff": "たれる", "Passive Neg": "たれない", "Causitive Aff": "たせる", 
"Causitive Neg": "たせない", "Causitive Passive Aff": "たせられる", "Causitive Passive Neg": "たせられない", 
"Imperative Aff": "て", "Imperative Neg": "つな"},

"godan_u_conj": {"Non-past Aff": "う", "Non-past Neg": "わない", "Non-past, Pol Aff": "います", 
"Non-past, Pol Neg": "いません", "Past Aff": "った", "Past Neg": "わなかった", "Past, Pol Aff": "いました", 
"Past, Pol Neg": "いませんでした", "Te Form, Aff": "って", "Te form, Neg": "わなくて", "Potential Aff": "える", 
"Potential Neg": "えない", "Passive Aff": "われる", "Passive Neg": "われない", "Causitive Aff": "わせる", 
"Causitive Neg": "わせない", "Causitive Passive Aff": "わせられる", "Causitive Passive Neg": "わせられない", 
"Imperative Aff": "え", "Imperative Neg": "うな"},


"kuru_conj": {"Non-past Aff": "る", "Non-past Neg": "ない", "Non-past, Pol Aff": "ます", 
"Non-past, Pol Neg": "ません", "Past Aff": "た", "Past Neg": "なかった", "Past, Pol Aff": "ました", 
"Past, Pol Neg": "ませんでした", "Te Form, Aff": "て", "Te form, Neg": "なくて", "Potential Aff": "られる", 
"Potential Neg": "られない", "Passive Aff": "られる", "Passive Neg": "られない", "Causitive Aff": "させる", 
"Causitive Neg": "させない", "Causitive Passive Aff": "させられる", "Causitive Passive Neg": "させられない", 
"Imperative Aff": "い", "Imperative Neg": "るな"},


"suru_conj": {"Non-past Aff": "する", "Non-past Neg": "しない", "Non-past, Pol Aff": "します", 
"Non-past, Pol Neg": "しません", "Past Aff": "した", "Past Neg": "しなかった", "Past, Pol Aff": "しました", 
"Past, Pol Neg": "しませんでした", "Te Form, Aff": "して", "Te form, Neg": "しなくて", "Potential Aff": "できる", 
"Potential Neg": "できない", "Passive Aff": "される", "Passive Neg": "されない", "Causitive Aff": "させる", 
"Causitive Neg": "させない", "Causitive Passive Aff": "させられる", "Causitive Passive Neg": "させられない", 
"Imperative Aff": "しろ", "Imperative Neg": "しるな"}
}

### Calling the proper conjugation
Each of these dictionaries are named after the type of verb_tag that is tied to it so now we must create a dictionary for these combinations.

In [37]:
merged_df.verb_tag.unique()

array(['Not Verb', 'suru', 'ichidan', 'ru', 'mu', 'aru', 'u', 'tsu', 'su',
       'ku_s', 'gu', 'bu', 'ku', 'nu', 'kuru'], dtype=object)

In [38]:
verb_conj_list = ['ichidan_verb_conj', 'godan_aru_conj', 'godan_bu_conj', 'godan_gu_conj', 'godan_ku_conj', 'godan_iku_conj',
'godan_mu_conj', 'godan_nu_conj', 'godan_ru_conj', 'godan_su_conj', 'godan_tsu_conj', 'godan_u_conj',
'kuru_conj', 'suru_conj']
verb_tag_list = ['ichidan', 'aru', 'bu', 'gu', 'ku', 'iku', 'mu', 'nu', 'ru', 'su', 'tsu', 'u', 'kuru', 'suru']

# Used to check that both are the same size before we zip and set them
len(verb_conj_list), len(verb_tag_list)

(14, 14)

In [39]:
dictionary_group = zip(verb_conj_list, verb_tag_list)
dictionary_group = set(dictionary_group)

In [40]:
print(dictionary_group)

{('godan_iku_conj', 'iku'), ('godan_mu_conj', 'mu'), ('godan_gu_conj', 'gu'), ('godan_u_conj', 'u'), ('godan_ru_conj', 'ru'), ('ichidan_verb_conj', 'ichidan'), ('godan_ku_conj', 'ku'), ('godan_tsu_conj', 'tsu'), ('godan_aru_conj', 'aru'), ('godan_bu_conj', 'bu'), ('godan_nu_conj', 'nu'), ('kuru_conj', 'kuru'), ('godan_su_conj', 'su'), ('suru_conj', 'suru')}


Now that we have a dictionary that denotes the conjugation dictionary to the correct verb_tag we can create a method that will call the step + the conjugation stem end for each of them whenever one of the verbs are checked within a sentence.

However doing such a method would be very complex and due to the nature of the project length we do not have enough time to complete that within the time limit given. 

So what will be next on the update for this project of Japanese data conditioning, we will be creating the verb conjugation checker.

#### For reference
We did not include ku_s on this list since they do have special conjugation that would make it even more difficult to work with so we will expand to that one later on.

### How would you create the function? 
I would actually have the function be the step right before conversion to English. It would convert all nouns, adjectives, adverbs and so on or have them no longer be within the view of the function and then with what is left check initially for a stem of the verb and see if it matches any of the combinations we have in our dictionary and then check the value before it to see if when combined with that if it makes a word. 

# 1.6
# Named Entity Recognition with Trigrams
I want to preface this with a very firm statement that in almost <b>every</b> case we will have split the sentence too much. And how we will be fixing this is by making a trigram of the list that we create from our splits with the explicit task of recognizing common prefixes and suffixes within Japanese. 

Japanese Prefixes are broken up into two categories: 

    - Honorific prefixes
        - E.g. お when put in front of words is saying 'The honorable __'
    - Characteristic prefixes
        - E.g. 再 (さい) when put in front of a word means repeating/again " ___ again "

Japanese Suffixes are broken up into four categories (for the sake of this project):

    - Adjective suffixes
        1. い adjectives are one of the most common ones and all end in い so they are easy to spot.
        2. な adjectives are also easy to spot with a few minor exceptions by they all end in な.
        3. の adjectives are the most rare of the three and due to this also being a particle we will have a harder time
               recognizing these.
        
    - Adverb suffixes
        Not every adverb will be an issue but the ones we must watch out for are the following:
            1. い adjectives can become adverbs that end in く.
            2. な adjectives can become adverbs that end in に.
    
    - Verb suffixes 
        1. Ichidan verbs are arguably the easiest to deal with in that they have one set way to conjugate.
        2. Godan verbs are the hardest in that there are five main bases and they conjugate differently.
        3. Irregular verbs are the smallest group and when making the code should be the first ones to find so they don't 
                get mixed up with Ichidan or Godan verbs.
           
    - Name/Honorific suffixes
        These are limited in number but would help immensely with figuring out N.E.R. later on since the only usage for 
                these are with names.
              
Some of these are groups that will be important outside of NER but with more precise knowledge and usage of conjugation with suffixes and prefixes the work we will need to start figuring out pronouns and locations will become much more reasonable.

As of right now there is no work done on the NER content because of the amount of work it would take to do that when the verb conjugation alone already took up a significant amoung

# 1.7
## Chunking and Stop Words
Chunking is not something that I will need to focus on for some time but it will be fairly similar to normal chunking in NLTK. 

As for Stop Word removal I will be waiting on that until I manage to fix the conjugation checker because some characters may fall through the cracks and be mistaken as particles until we mannage to either stem or lemmatize the words.

# 1.8 
## Parts of Speech Tagging
For the most part this project has already laid down most of the parts of speech tagging as we created the dataframes and labeled the data. However if we were to expand on the web scraper as mentioned previously in the web scraping python program we could very easily make a proper parts of speech tagger that is robust enough to compare with normal NLTK. It may not pick up on every situation that the word is used in but Japanese is very particular in how a word becomes a different part of speech and so this would be viable as a very end of project completion project.




# 2.0
# Potential Expansion of the Project Post completion
There are always options out there to continue the project beyond what I already have plans for but one that has caught my interest it to actually do work with word embedding and character embedding with the project once more fully finished to see how well the supplemental library holds up 

### What is Word Embedding / Character Embedding?
There are two ways we can go about this. For word embedding we won't need to do anything with the English translated portions since the translated words would work perfecting with word embedding.

But if we wish to see how those compare to say the character embedding that Japanese scripts have that would be interesting and be a potential end of project expansion.

# 3.0 
# Final Thoughts
I guess one thing that I knew going into a project like this was that the more I try to reach the harder it would get to complete anything. The project as a whole was more about learning where to draw the line in the sand and say that something was good enough or close enough to completion as it is right now because the project can be continuously worked on in so many ways and that is not even including if we were to add an entirely new set of vocabulary to the mix.

What the project has completed are:

    - Pre-language processing
    - Basic sentence processing
    - Basic Stemming / Lemmatization
    - Particle recognition
    - Word/Character lookup and reverse lookup
    - Basic Parts of Speech Tagging
    - Basic Tokenization
    
While this covers a lot and has the data in a more managable spot than where I had it, there is still so many things that could be done to it to improve upon the situation and to make the project even more robust.

Going back to what was mentioned at the top within the abstract, I have so many stretch goals that just were not possible during the time constraint of the initial due date, those being:

    - Including more particles
    - Including complex Stemming / Lemmatization
    - Complex sentence processing
    - Stop Word inclusion
    - NER/NED inclusion
    - Better Parts of Speech Tagging
    

What any person can do with the current build of the project are:
  
    - Split a Japanese sentence into words or rough equivalents
    - Make that sentence into a readable list
    - Separate particles from that list
    - Take individual parts of that list and put them into the search functions
    - Obtain the English from those functions and reverse search
    
And with those steps that is already far better than where Japanese and NLTK started at which is not even being able to be recognized as separate words.

Thank you for your time in going over and reading the project.

### Suggestions?
If you have any suggestions or critiques of the project please direct any of them to my Github or to me directly and we can have some insightful discussion about the language, NLTK, or anything in between. 

This is a love project of my own creation and anything that can help is something worth considering.