# 1. What's Text Mining ?

Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. You can use text mining to analyze vast collections of textual materials to capture key concepts, trends and hidden relationships.
Link: https://www.ibm.com/topics/text-mining

# 2. Processing

## 1. Preprocessing
## 2. Document Representation
## 3. Association Rule Mining
## 4. Rule Generation
## 5. Rule Evaluation and Selection
## 6. Interpretation and Visualization
## 7. Application

Link: https://arxiv.org/ftp/arxiv/papers/1009/1009.4582.pdf

# 3. Demo

## 1. Read Data

In [248]:
list_word = []

In [249]:
with open('data.txt','r') as f:
    info_temp = f.read()

In [250]:
list_word = info_temp.split(' ')

In [251]:
print(list_word)

['The', 'Ho', 'Chi', 'Minh', 'City', 'University', 'of', 'Technology', 'and', 'Education', 'HCMUTE', 'stands', 'beacon', 'of', 'academic', 'excellence', 'in', 'Vietnam', 'Established', 'in', '1957', 'the', 'Ho', 'Chi', 'Minh', 'City', 'Pedagogical', 'University', 'its', 'transformation', 'into', 'HCMUTE', 'in', '2006', 'marked', 'significant', 'milestone', 'in', 'its', 'journey', 'Situated', 'in', 'the', 'bustling', 'metropolis', 'of', 'Ho', 'Chi', 'Minh', 'City', 'the', "university's", 'campus', 'in', 'Thu', 'Duc', 'District', 'serves', 'vibrant', 'hub', 'for', 'learning', 'and', 'innovation', 'Offering', 'diverse', 'array', 'of', 'undergraduate', 'and', 'graduate', 'programs', 'HCMUTE', 'specializes', 'in', 'engineering', 'technology', 'and', 'education', 'From', 'mechanical', 'engineering', 'to', 'computer', 'science', 'civil', 'engineering', 'to', 'architecture', 'and', 'education', 'management', 'the', 'university', 'caters', 'to', 'wide', 'spectrum', 'of', 'academic', 'interests'

# 2. Creating Itemsets

## Using combination in math to find all.

In [252]:
from itertools import combinations 

In [253]:
transaction = []
for i in range(1,5):
    comb = combinations(list_word, i)
    transaction += list(comb)

In [254]:
print(transaction[800:1000])

[('University', 'in'), ('University', 'the'), ('University', 'bustling'), ('University', 'metropolis'), ('University', 'of'), ('University', 'Ho'), ('University', 'Chi'), ('University', 'Minh'), ('University', 'City'), ('University', 'the'), ('University', "university's"), ('University', 'campus'), ('University', 'in'), ('University', 'Thu'), ('University', 'Duc'), ('University', 'District'), ('University', 'serves'), ('University', 'vibrant'), ('University', 'hub'), ('University', 'for'), ('University', 'learning'), ('University', 'and'), ('University', 'innovation'), ('University', 'Offering'), ('University', 'diverse'), ('University', 'array'), ('University', 'of'), ('University', 'undergraduate'), ('University', 'and'), ('University', 'graduate'), ('University', 'programs'), ('University', 'HCMUTE'), ('University', 'specializes'), ('University', 'in'), ('University', 'engineering'), ('University', 'technology'), ('University', 'and'), ('University', 'education'), ('University', 'Fr

In [255]:
len(transaction)

11725155

In [256]:
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
te = TransactionEncoder()
te_ary = te.fit(transaction).transform(transaction)
dataset = pd.DataFrame(te_ary, columns=te.columns_)
dataset.head()

Unnamed: 0,1957,2006,Chi,City,District,Duc,Education,Established,From,HCMUTE,...,technology,the,to,transformation,undergraduate,university,university's,vibrant,wide,with
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [257]:
dataset.shape

(11725155, 86)

In [258]:
dataset_2=dataset[0:10000:]

In [259]:
dataset_2.shape

(10000, 86)

# 3.Using Apriori in mlxtend:

In [260]:
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(dataset_2, min_support=0.001, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets[0:5:]

Unnamed: 0,support,itemsets,length
0,0.0143,(1957),1
1,0.0142,(2006),1
2,0.0538,(Chi),1
3,0.0538,(City),1
4,0.0142,(District),1


In [261]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.003) ]

Unnamed: 0,support,itemsets,length
92,0.0154,"(The, Chi)",2
94,0.0034,"(and, Chi)",2
103,0.0154,"(The, City)",2
105,0.0034,"(and, City)",2
113,0.0129,"(The, Education)",2
122,0.0153,"(The, HCMUTE)",2
124,0.0034,"(and, HCMUTE)",2
132,0.0154,"(Ho, The)",2
134,0.0034,"(and, Ho)",2
140,0.0154,"(The, Minh)",2


# 4. Khai phá luật kết hợp

In [262]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.2)
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift",ascending=False)[["antecedents","consequents","antecedent support","consequent support","support","confidence","lift"]][0:5:]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
252,(stands),(The),0.0258,0.1615,0.0129,0.5,3.095975
253,(The),(stands),0.1615,0.0258,0.0129,0.079876,3.095975
132,(The),(Technology),0.1615,0.0258,0.0129,0.079876,3.095975
160,(The),(beacon),0.1615,0.0258,0.0129,0.079876,3.095975
133,(Technology),(The),0.0258,0.1615,0.0129,0.5,3.095975


In [263]:
rules.sort_values("confidence",ascending=False)[["antecedents","consequents","antecedent support","consequent support","support","confidence","lift"]][0:5:]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift
161,(beacon),(The),0.0258,0.1615,0.0129,0.5,3.095975
55,(Education),(The),0.0258,0.1615,0.0129,0.5,3.095975
133,(Technology),(The),0.0258,0.1615,0.0129,0.5,3.095975
252,(stands),(The),0.0258,0.1615,0.0129,0.5,3.095975
145,(University),(The),0.0399,0.1615,0.0142,0.35589,2.203652


In [264]:
rules.shape

(408, 12)

# 4. Application

In [265]:
rules_a = rules.sort_values("confidence",ascending=False)[["antecedents","consequents","antecedent support","consequent support","support","confidence","lift"]]

In [266]:
rules_a[["antecedents","consequents"]].head()

Unnamed: 0,antecedents,consequents
161,(beacon),(The)
55,(Education),(The)
133,(Technology),(The)
252,(stands),(The)
145,(University),(The)


In [267]:
rules_list = []
for rows in rules.itertuples(): 
    x = str(rows.antecedents).strip("frozenset").strip(")(}{'")
    y = str(rows.consequents).strip("frozenset").strip(")(}{'")
    my_list =[x, y]
    rules_list.append(my_list) 

In [268]:
print(rules_list)

[['1957', 'The'], ['The', '1957'], ['The', '2006'], ['2006', 'The'], ['City', 'Chi'], ['Chi', 'City'], ['HCMUTE', 'Chi'], ['Chi', 'HCMUTE'], ['Ho', 'Chi'], ['Chi', 'Ho'], ['Minh', 'Chi'], ['Chi', 'Minh'], ['The', 'Chi'], ['Chi', 'The'], ['University', 'Chi'], ['Chi', 'University'], ['and', 'Chi'], ['Chi', 'and'], ['engineering', 'Chi'], ['Chi', 'engineering'], ['in', 'Chi'], ['Chi', 'in'], ['of', 'Chi'], ['Chi', 'of'], ['Chi', 'the'], ['the', 'Chi'], ['Chi', 'to'], ['to', 'Chi'], ['City', 'HCMUTE'], ['HCMUTE', 'City'], ['Ho', 'City'], ['City', 'Ho'], ['City', 'Minh'], ['Minh', 'City'], ['The', 'City'], ['City', 'The'], ['City', 'University'], ['University', 'City'], ['and', 'City'], ['City', 'and'], ['City', 'engineering'], ['engineering', 'City'], ['in', 'City'], ['City', 'in'], ['of', 'City'], ['City', 'of'], ['City', 'the'], ['the', 'City'], ['City', 'to'], ['to', 'City'], ['District', 'The'], ['The', 'District'], ['Duc', 'The'], ['The', 'Duc'], ['The', 'Education'], ['Education', '

In [303]:
text = "Ho"
dk = text
dem = 0

In [304]:
for d in range(0,3):
    t = 0
    for i in range(dem,len(rules_list)):
        if(rules_list[i][0]==dk and not(rules_list[i][1] in text)):
            dk = rules_list[i][1]
            print(dk)
            text+=" "+rules_list[i][1]
            break
        t+=1
    dem+=t

Chi
Minh
City


In [305]:
print(text)

Ho Chi Minh City
