# Table of Contents (need to re-edit it)
- [Introduction](#Introduction)
- [Project Objective](#Project Objective)
- [Process Summary](#Process Summary)
- [EDA](#EDA)
- [Clean the example review texts](#clean-example-texts)
- [Learn to prepare texts for modeling](#learn-to-prepare-text)
- [Clean & prepare an entire column of texts](#clean-prepare-entire-text-column)
- [LDA analysis](#lda)

<a id='Introduction'></a>
# Introduction

<a id='Project Objective'></a>
# Project Objective

<a id='symbols'></a>
# Meanings of some symbols

In [1]:
import numpy as np
import pandas as pd
from numpy import NaN as NA
import numpy.random as random
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.weightstats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('ticks')


In [2]:
np.random.seed(30)
random_state = 30

In [47]:
insur_t= pd.read_csv('bzan6357_insurance_3_TRAINING.csv')
insur_s=pd.read_csv('bzan6357_insurance_3_SCORE.csv')

In [48]:
insur_t.dtypes

id_new              object
buy                  int64
age                  int64
gender              object
tenure               int64
region               int64
dl                   int64
has_v_insurance      int64
v_age               object
v_accident          object
v_prem_quote       float64
cs_rep               int64
dtype: object

In [49]:
insur_t['v_accident'] = (insur_t['v_accident'] == 'yes').astype(int)
insur_t['gender'] =(insur_t['gender'] == 'female').astype(int)

In [50]:
insur_t['v_age'] = insur_t['v_age'].astype('category')

In [51]:
insur_t.dtypes

id_new               object
buy                   int64
age                   int64
gender                int64
tenure                int64
region                int64
dl                    int64
has_v_insurance       int64
v_age              category
v_accident            int64
v_prem_quote        float64
cs_rep                int64
dtype: object

In [63]:
insur_t.head(10)

Unnamed: 0,id_new,buy,age,gender,tenure,region,dl,has_v_insurance,v_age,v_accident,v_prem_quote,cs_rep
0,a00000000,0,34,1,31,19,1,0,1-2 year,1,27715.0,154
1,a00000001,0,50,0,211,34,1,0,1-2 year,1,33945.0,154
2,a00000002,0,42,1,122,29,1,0,1-2 year,1,37577.0,163
3,a00000003,0,28,0,75,3,1,0,1-2 year,1,2630.0,154
4,a00000004,0,75,1,19,28,1,0,1-2 year,1,47511.0,122
5,a00000005,0,25,0,55,45,1,1,< 1 year,0,32423.0,151
6,a00000006,0,34,0,246,28,1,1,1-2 year,0,59404.0,122
7,a00000007,0,76,0,148,28,1,0,1-2 year,1,48910.0,122
8,a00000008,0,42,0,292,28,1,0,1-2 year,1,29834.0,122
9,a00000009,1,50,0,127,28,1,0,> 2 years,1,56986.0,122


In [53]:
n_labels=insur_t['v_age'].nunique()
labels=np.sort(insur_t['v_age'].unique())
print(f'there are {n_labels} discrete labes in "area_code";they are: {labels}')
print(f'thus,{n_labels - 1} flag vars are needed')

there are 3 discrete labes in "area_code";they are: ['1-2 year' '< 1 year' '> 2 years']
thus,2 flag vars are needed


In [59]:
flags_v_age=pd.get_dummies(insur_t["v_age"],prefix='flag_v_age',drop_first=True)
flags_v_age['id_new']=insur_t['id_new'].copy()
insur_t1=pd.merge(left=insur_t,right=flags_v_age,how='inner',on='id_new',validate='1:1')



In [62]:
insur_t[['id_new','v_age']].head(3)
insur_t1[['id_new']+ [col for col in insur_t1.columns if 'flag' in col]].head(10)

Unnamed: 0,id_new,flag_v_age_< 1 year,flag_v_age_> 2 years
0,a00000000,0,0
1,a00000001,0,0
2,a00000002,0,0
3,a00000003,0,0
4,a00000004,0,0
5,a00000005,1,0
6,a00000006,0,0
7,a00000007,0,0
8,a00000008,0,0
9,a00000009,0,1


In [64]:
insur_t1.dtypes

id_new                    object
buy                        int64
age                        int64
gender                     int64
tenure                     int64
region                     int64
dl                         int64
has_v_insurance            int64
v_age                   category
v_accident                 int64
v_prem_quote             float64
cs_rep                     int64
flag_v_age_< 1 year        uint8
flag_v_age_> 2 years       uint8
dtype: object

<a id='EDA'></a>
# EDA

<a id='clean-example-texts'></a>
# Clean the example review texts

<a id='learn-to-prepare-text'></a>
# Learn to prepare texts for modeling

## Import packages

## Demo individual functions

#### Caution! Output of `simple_tokenize()` is not an array!

### 3. remove terms of extreme lengths
Typically, a term that is below 3-char long or over 15-char long is very rare; rare terms are minimally useful in text-mining, unless the kind of documents you deal with consists of many highly professional and rare vocabularies

E.g., short terms such as "as", "is" appear too often in most documents; long terms such as "Buckminsterfullerene" (type of carbon) may only appear in 1 document in the entire corpus, they are eventually useless in common text-mining projects

>In fact, many of the extremely short words ("as", "is", etc.) are known as **stop-word**

### 4. remove stop-words
>`STOPWORDS` from `gensim` is a set containing lower-case common English stop-words