![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/webinars_conferences_etc/multi_lingual_webinar/4_Unsupervise_Chinese_Keyword_Extraction_NER_and_Translation_from_Chinese_News.ipynb)

![Flags](http://ckl-it.de/wp-content/uploads/2021/02/flags.jpeg)

In [1]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

--2022-04-15 12:02:41--  https://setup.johnsnowlabs.com/nlu/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh [following]
--2022-04-15 12:02:41--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1665 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2022-04-15 12:02:41 (29.9 MB/s) - written to stdout [1665/1665]

Installing  NLU 3.4.3rc2 with  PySpark 3.0.3 and Spark NLP 3.4.2 for Google Colab ...
Get:1 http

# Analyzing chinese News Articles With NLU
## This notebook showcases how to extract Chinese Keywords  Unsupervied with YAKE and Named Entities and translate them to English
### In addition, we will leverage the Chinese WordSegmenter and Lemmatizer to preprocess our data further and get a better view fof our data distribution


# [Chinese official daily news](https://www.kaggle.com/noxmoon/chinese-official-daily-news-since-2016)
![Chinese News](https://upload.wikimedia.org/wikipedia/zh/6/69/XINWEN_LIANBO.png)
### Xinwen Lianbo is a daily news programme produced by China Central Television. It is shown simultaneously by all local TV stations in mainland China, making it one of the world's most-watched programmes. It has been broadcast since 1 January 1978.
wikipedia



In [3]:
import pandas as pd
df = pd.read_csv('./chinese_news.csv')
df

FileNotFoundError: ignored

# Depending how we pre-process our text, we will get different keywords extracted with YAKE. In This tutorial we will see the effect of **Lemmatization** and **Word Segmentation** and see how the distribution of Keywords changes 
- Lemmatization
- Word Segmentation

# Apply YAKE - Keyword Extractor to the raw text
First we do no pre-processing at all and just calculate keywords from the raw titles with YAKE

In [4]:
yake_df = nlu.load('yake').predict(df.headline)
yake_df

NameError: ignored

##  The predicted Chinese Keywords dont show up on Pandas Label and you probably do not speek Chinese!
### This is why we will translate each extracted Keyword into english and then take a look at the distribution again

In [None]:
yake_df.explode('keywords_classes').keywords_classes.value_counts()[0:100].plot.bar(title='Top 100 in Chinese News Articles. No Chinese Keywords :( So lets translate!', figsize=(20,8))

### We get the top 100 keywords and store the counts toegether with the keywords in a new DF

In [None]:
top_100_zh = yake_df.explode('keywords_classes').keywords_classes.value_counts()[0:100]
top_100_zh = pd.DataFrame(top_100_zh)
# Create new DF from the counts
top_100_zh['zh'] = top_100_zh.index
top_100_zh.reset_index(inplace=True)
top_100_zh


### Now we can just translate each predicted keyword with `zh.translate_to.en` in 1 line of code and see what is actually going on in the dataset

In [None]:
top_100_en = nlu.load('zh.translate_to.en').predict(top_100_zh.zh)
top_100_en

#### Write the translations into the df with the Keyword counts so we can plot them together in the next step

In [None]:
# Write translation back to the keyword df with the counts
top_100_zh['en']= top_100_en.translation
top_100_zh

## Now we can simply look at every keyword as a bar chart with the actual translation of it and understand what keywordsa ppeared in chinese news!

In [None]:
top_100_zh.index = top_100_zh.en
top_100_zh.keywords_classes.plot.barh(figsize=(20,20), title='Distribution of top 100 translated chinese News Articles generated by YAKE alogirthm applied to RAW data')

# Apply Yake to Segmented/Tokenized data
We gave the YAKE algorithm full heatlines which where not segmented. To better understand the Chinese text ,we can segment it into token and analyze their occurcence instead
## YAKE + Word Segmentation

In [None]:
# Segment words into tokenz with the word segmenter
# This will output 1 row per token
seg_df = nlu.load('zh.segment_words').predict(df.headline)
seg_df 

### Join the tokens back as white space seperated strings for the Yake Keyword extraction in the next step

In [None]:
# Join the tokens back as white space seperated strings
joined_segs = seg_df.token.groupby(seg_df.index).transform(lambda x : ' '.join(x)).drop_duplicates()
joined_segs

### Now we can extract keywords with yake on the whitespace seperated tokens 


In [None]:
seg_yake_df = nlu.load('yake').predict(joined_segs)
seg_yake_df

In [None]:
# Get top 100 occoring Keywords from the joined segmented tokens
top_100_seg_zh = seg_yake_df.explode('keywords_classes').keywords_classes.value_counts()[0:100]#.plot.bar(title='Top 100 in Chinese News Articles Segmented', figsize=(20,8))
top_100_seg_zh = pd.DataFrame(top_100_seg_zh )
top_100_seg_zh

## Get top 100 keywords and Translate them like we did for the raw Data as data preperation for the visualization of the keyword distribution

In [None]:
# Create new DF from the counts
top_100_seg_zh['zh'] = top_100_seg_zh.index
top_100_seg_zh.reset_index(inplace=True)
# Write Translations back to df with keyword counts

top_100_seg_zh['en'] = nlu.load('zh.translate_to.en').predict(top_100_seg_zh.zh).translation

### Visualize the distirbution of the Keywords extracted from the segmented tokens
We can observe that we now have a very different distribution than originally

In [None]:
top_100_seg_zh.index = top_100_seg_zh.en
top_100_seg_zh.keywords_classes.plot.barh(figsize=(20,20), title = 'Segmented Keywords YAKE Distribution')

# Apply Yake to Segmented and Lemmatized data

In [None]:
# Automated Word Segmentation Included!
zh_lem_df = nlu.load('zh.lemma').predict(df.headline)
zh_lem_df

## Join tokens into whitespace seperated string like we did previosuly for Word Segmentation

In [None]:
zh_lem_df['lem_str'] = zh_lem_df.lemma.str.join(' ')
zh_lem_df

## Extract Keywords on Stemmed + Word Segmented Chinese text

In [None]:
yake_lem_df = nlu.load('yake').predict(zh_lem_df.lem_str)
yake_lem_df

In [None]:
top_100_stem = yake_lem_df.explode('keywords_classes').keywords_classes.value_counts()[:100]
top_100_stem = pd.DataFrame(top_100_stem)
# Create new DF from the counts
top_100_stem['zh'] = top_100_stem.index
top_100_stem.reset_index(inplace=True)
# Write Translations back to df with keyword counts

top_100_stem['en']  = nlu.load('zh.translate_to.en').predict(top_100_stem.zh).translation
top_100_stem

# Plot the Segmented and Lemmatized Distribution of extracted keywords 

In [None]:
top_100_stem.index = top_100_stem.en
top_100_stem.keywords_classes.plot.barh(figsize=(20,20), title='Distribution of top 100 translated chinese News Artzzzicles generated by YAKE alogirthm applied to Lemmatized and Segmented Chinese Text')

# Extract Chinese Named entities

In [None]:
zh_ner_df = nlu.load('zh.ner').predict(df.iloc[:1000].headline, output_level='document')
zh_ner_df

In [None]:
# Translate Detected Chinese Entities to English
en_entities = nlu.load('zh.translate_to.en').predict(zh_ner_df.explode('entities').entities)
en_entities

In [None]:
en_entities.translation.value_counts()[0:100].plot.barh(figsize=(20,20), title = "Top 100 Translated detected Named entities")

# There are many more models!
## Checkout [the Modelshub](https://nlp.johnsnowlabs.com/models) and the [NLU Namespace](https://nlu.johnsnowlabs.com/docs/en/spellbook) for more models