# Text mining

## Text Assemble

It is observed that 70% of data available to any businesses is unstructured. The first step is collating unstructured data from different sources such as open-ended feedback, phone calls, email support, online chat and social media networks like Twitter, LinkedIn and Facebook. Assembling these data and applying mining/machine learning techniques to analyze them provides valuable opportunities for organizations to build more power into customer experience. 
There are several libraries available for extracting text content from different formats discussed above. By far the best library that provides simple and single interface for multiple formats is ‘textract’ (open source MIT license). Note that as of now this library/package is available for Linux, Mac OS and not Windows. Below is a list of supported formats.

### For example twitter
   #### API access token

- Goto https://apps.twitter.com/
- Click on 'Create New App'
- Fill the required information and click on 'Create your Twitter Application'
- You'll get the access details under 'Keys and Access Tokens' tab

In [59]:
import pandas as pd
import numpy as np
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

In [44]:
access_token = "8397390582---------------------------------"
access_token_secret = "dr5L3QHHkIls6Rbffz-------------------"
consumer_key = "U1eVHGzL-----------------"
consumer_secret = "qATe7kb41zRAz------------------------------------"

auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [119]:
fetched_tweets = api.search(q=['Bitcoin','ethereum'], result_type='recent', lang='en', count=10)
print ("Number of tweets: ", len(fetched_tweets))

Number of tweets:  10


In [125]:
for tweet in fetched_tweets:
    print ('Tweet AUTOR: ', tweet.author.name)
    print ('Tweet ID: ', tweet.id)
    print ('Tweet Text: ', tweet.text, '\n')

Tweet AUTOR:  Jeanie Fantoni
Tweet ID:  939420455605174273
Tweet Text:  RT @6BillionPeople: Welcome to my Twitter. Im "MarQuis Trill @6BillionPeople " Everyone asks me "How can I buy #Bitcoin #Litecoin #Ethereum… 

Tweet AUTOR:  Eliana Holmes
Tweet ID:  939420449699717120
Tweet Text:  RT @6BillionPeople: Welcome to my Twitter. Im "MarQuis Trill @6BillionPeople " Everyone asks me "How can I buy #Bitcoin #Litecoin #Ethereum… 

Tweet AUTOR:  Christe Louise
Tweet ID:  939420448206422016
Tweet Text:  RT @rateico: ICO MARKET 2.0 👍 Self-Regulation of the #ICO #Market 🌐Ecosystem
Community Intelligence 🔁 https://t.co/1HrpeLgQcK Join Our ICO… 

Tweet AUTOR:  Nurfitriyana
Tweet ID:  939420445996142592
Tweet Text:  RT @Bazista_io: #Bazista platform review - concept, vision and key advantages. 
Read more:

https://t.co/EDLLN8hPfn

#ICO #tokensale #ether… 

Tweet AUTOR:  Chieko Jean-pierre
Tweet ID:  939420441189392385
Tweet Text:  RT @ico_report: The Largest Channel about ICO in Telegram https://t.

There are many other way to collect data from PDF, Voice and etc. 

### Preprocessing


##### NLTK (Natural Language Toolkit)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries


** Bad news nltk not support persian **



** Good news hazm is a similar library for persian language processing **


In [103]:
import nltk 
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

**stopword** 

In [60]:
from nltk.corpus import stopwords

stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [61]:
len(stopwords.words())

3136

In [62]:
import hazm 
hazm.stopwords_list()

['و',
 'در',
 'به',
 'از',
 'که',
 'این',
 'را',
 'با',
 'است',
 'برای',
 'آن',
 'یک',
 'خود',
 'تا',
 'کرد',
 'بر',
 'هم',
 'نیز',
 'گفت',
 'می\u200cشود',
 'وی',
 'شد',
 'دارد',
 'ما',
 'اما',
 'یا',
 'شده',
 'باید',
 'هر',
 'آنها',
 'بود',
 'او',
 'دیگر',
 'دو',
 'مورد',
 'می\u200cکند',
 'شود',
 'کند',
 'وجود',
 'بین',
 'پیش',
 'شده_است',
 'پس',
 'نظر',
 'اگر',
 'همه',
 'یکی',
 'حال',
 'هستند',
 'من',
 'کنند',
 'نیست',
 'باشد',
 'چه',
 'بی',
 'می',
 'بخش',
 'می\u200cکنند',
 'همین',
 'افزود',
 'هایی',
 'دارند',
 'راه',
 'همچنین',
 'روی',
 'داد',
 'بیشتر',
 'بسیار',
 'سه',
 'داشت',
 'چند',
 'سوی',
 'تنها',
 'هیچ',
 'میان',
 'اینکه',
 'شدن',
 'بعد',
 'جدید',
 'ولی',
 'حتی',
 'کردن',
 'برخی',
 'کردند',
 'می\u200cدهد',
 'اول',
 'نه',
 'کرده_است',
 'نسبت',
 'بیش',
 'شما',
 'چنین',
 'طور',
 'افراد',
 'تمام',
 'درباره',
 'بار',
 'بسیاری',
 'می\u200cتواند',
 'کرده',
 'چون',
 'ندارد',
 'دوم',
 'بزرگ',
 'طی',
 'حدود',
 'همان',
 'بدون',
 'البته',
 'آنان',
 'می\u200cگوید',
 'دیگری',
 'خواهد_شد',


In [63]:
len(hazm.stopwords_list())

389

### Feature Extraction 

In [104]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [105]:
text = pd.read_csv('./Datasets/text_mining.csv').drop(['document_id','direction'], axis =1)
text2 =text[(text['category_id'] == 1) | (text['category_id'] == 8)  ]

In [106]:
text[text['category_id'] == 1].head()

Unnamed: 0,category_id,news_id,text
2022,1,1050208.0,هواشناسی برای گلستان یخبندان و کاهش دما را پیش...
2023,1,992.0,آغاز دوباره بارش برف و باران در نوار شمالی کشو...
2024,1,101394.0,احتمال بارش برف و باران از اواخر وقت امروز در ...
2025,1,250969.0,کاهش نسبی دمای هوا در مازندران / بارش برف و با...
2026,1,1953961.0,باران و برف در آذربایجان غربی طی هفته آینده/ ت...


In [68]:
text[text['category_id'] == 8].head()

Unnamed: 0,category_id,news_id,text
942,8,490245.0,موافقت اردن با مبادله یک تروریست با اسیر ژاپنی...
943,8,1258198.0,داعش یک مسجد تاریخی موصل را منفجر کرد \n به گز...
944,8,648991.0,عقب نشینی ناگهانی اعضای خارجی داعش از خیابان ه...
945,8,864821.0,جان بولتون: اوباما ده سال مقاومت غرب در برابر ...
946,8,882.0,کاخ سفید در سیاست خارجی خود جدی نیست و بسیار م...


In [69]:
climate_con = text[text['category_id'] == 1].iloc[0:5]
politics = text[text['category_id'] == 8].iloc[0:5]

In [70]:
countvectorizer = CountVectorizer()

In [71]:
cli = ''
for i in climate_con.text.as_matrix(): cli = cli + ' ' + i
pol = ''
for i in politics.text.as_matrix(): pol = pol + ' ' + i

In [72]:
content = [cli, pol]

In [73]:
countvectorizer.fit(content)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [74]:
doc_vec = countvectorizer.transform(content)

In [75]:
df = pd.DataFrame(doc_vec.toarray().transpose(), index = countvectorizer.get_feature_names())

In [76]:
df.sort_values(0, ascending=False)

Unnamed: 0,0,1
در,57,40
با,31,16
به,21,41
بارش,18,0
از,18,35
برف,17,0
باران,15,0
پیش,14,3
روز,13,0
استان,13,2


In [77]:
df.sort_values(1, ascending=False)

Unnamed: 0,0,1
به,21,41
در,57,40
از,18,35
است,8,26
این,4,26
اوباما,0,23
را,4,23
که,4,21
می,10,17
های,8,16


In [78]:
tfidf = TfidfVectorizer()

In [79]:
tfidf_vec = tfidf.fit_transform(content)

In [80]:
df2 = pd.DataFrame(tfidf_vec.toarray().transpose(), index =tfidf.get_feature_names())

In [81]:
df2.sort_values(0, ascending=False)

Unnamed: 0,0,1
در,0.492603,0.344385
با,0.267907,0.137754
بارش,0.218633,0.000000
برف,0.206486,0.000000
باران,0.182194,0.000000
به,0.181485,0.352995
روز,0.157901,0.000000
از,0.155559,0.301337
ابری,0.133609,0.000000
بینی,0.121463,0.000000


In [82]:
df2.sort_values(1, ascending=False)

Unnamed: 0,0,1
به,0.181485,0.352995
در,0.492603,0.344385
از,0.155559,0.301337
اوباما,0.000000,0.278312
است,0.069137,0.223850
این,0.034569,0.223850
را,0.034569,0.198021
ایران,0.000000,0.181508
داعش,0.000000,0.181508
که,0.034569,0.180802


In [83]:
tfidf.vocabulary_

{'12': 0,
 '15': 1,
 '1648': 2,
 '2003': 3,
 '2013': 4,
 '25': 5,
 '26': 6,
 '27': 7,
 '29': 8,
 '50': 9,
 '76': 10,
 '86003': 11,
 '89003': 12,
 '93': 13,
 'آبگرفتگی': 14,
 'آخرین': 15,
 'آذربایجان': 16,
 'آزاد': 17,
 'آزادی': 18,
 'آسمان': 19,
 'آشوب': 20,
 'آغاز': 21,
 'آلمان': 22,
 'آلود': 23,
 'آماده': 24,
 'آمار': 25,
 'آمریکا': 26,
 'آمریکایی': 27,
 'آن': 28,
 'آنجلس': 29,
 'آنچنانی': 30,
 'آنچه': 31,
 'آنگلا': 32,
 'آورد': 33,
 'آید': 34,
 'آینده': 35,
 'ابر': 36,
 'ابراز': 37,
 'ابری': 38,
 'احتمال': 39,
 'اخبار': 40,
 'اختلال': 41,
 'اخیر': 42,
 'ادامه': 43,
 'ادعای': 44,
 'اذعان': 45,
 'اراضی': 46,
 'ارتش': 47,
 'ارتفاعات': 48,
 'اردن': 49,
 'اردنی': 50,
 'ارزیابی': 51,
 'از': 52,
 'ازای': 53,
 'اساس': 54,
 'اساسی': 55,
 'اسبق': 56,
 'است': 57,
 'استان': 58,
 'استفاده': 59,
 'اسفند': 60,
 'اسلامی': 61,
 'اسیر': 62,
 'اش': 63,
 'اشاره': 64,
 'اشتباهات': 65,
 'اطمینان': 66,
 'اطمینانی': 67,
 'اظهار': 68,
 'اعتماد': 69,
 'اعضای': 70,
 'اعظم': 71,
 'اعلام': 72,
 'افراطی': 73,
 '

### Stemming

In [84]:
from nltk.stem import SnowballStemmer


In [85]:
stemmer = SnowballStemmer('english')

In [86]:
stemmer.stem("impressive")

'impress'

In [87]:
stemmer.stem("impressness")

'impress'

In [88]:
from hazm import Stemmer

In [89]:
stem2 = Stemmer()

In [90]:
stem2.stem('کتاب ها')

'کتاب '

In [91]:
stem2.stem('کتاب‌هایش')

'کتاب'

In [92]:
stem2.stem('کتاب هایم')

'کتاب '

## Naive Bayes

> Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.

Bayes Rule:

$$ P(Y|X) = \frac{P(Y)P(X|Y)}{P(X)} $$

and Naive assumption:

$$ P(x_i | y,x_1,x_2,...x_{i-1},x_{i+1},...,x_n) = P(x_i|y) $$

leads to:

$$ P(y| x_1,x_2,...,x_n) = \frac{P(y) \prod_{i=1}^n P(x_i|y)}{P(x_1,x_2,...,x_n)} $$

If the purpose is only to classify:

$$ \hat{y} = arg\max_y P(y)\prod_{i=1}^n P(x_i|y) $$

In [107]:
from sklearn.naive_bayes import GaussianNB

In [108]:
model = GaussianNB()

In [113]:
model.fit(tfidf.transform(text2.text).toarray(), text2.category_id)

GaussianNB(priors=None)

In [123]:
model.sigma_.shape

(2, 700)

In [101]:
from sklearn.metrics import classification_report
print(classification_report(text2.category_id, model.predict(tfidf.transform(text2.text).toarray())))

             precision    recall  f1-score   support

          1       0.74      1.00      0.85        74
          8       1.00      0.95      0.97       483

avg / total       0.97      0.95      0.96       557

