# Text Quality Analyzer
by Ricky | v1(20/04/2020)

## Observations

Are interpreted as wrong words:
- URLs
- alphanumeric codes (ej: dsa543nqhf7dh3ndsaku43878)

Are interpreted as correct words:
- fem words
- plural words
- verb conjugations
- words accompanied by symbols or punctuation at the beginning or end (ex: "example:")

Are omited:
- All the expresiond than contains only numbers and symbols (ex: 19/04/2020, 12.345.678-9)


## Possible Improvements:

- Save all the wrong words to analyze if they really are wrong words or are technisisms, abbreviations or terms of the company culture.

- Detect the leanguaje of the text (so not all words of the texts are selected as wrong words)

## Initial Load

### Imports

In [1]:
from os import listdir
import pandas as pd
import boto3

from io import BytesIO
from functools import reduce
from os.path import join as pjoin
from datetime import datetime as dt

### Variables

In [2]:
# Sybols to delete before and after the word:
NOT_ALLOWED = ' \n.,-:;?¿\'"'
NOT_ALLOWED_REGEX = r'\n\.,\-:;?¿\'"'

DIRECTORY_DICTS = 'languages_dictionaries'
SPANISH_DICT = 'spanish.0'
ENGLISH_DICT = 'english.0'

# Bucket variables:
bucket = "kranio-datalake"
key = "internal/webinar/leoCamilo/raw/ratings.csv"
key2 = "internal/webinar/leoCamilo/raw/complaints.csv"

AWS_REGION = 'us-east-2'

## Exploratory Analysis

### Helpers

In [3]:
def file_to_words_list(file_path):
    '''
    Receives the path of a file. Each line of the file contains one word.
    Return a list with the words.
    '''
    with open(file_path, 'r', encoding='utf-8') as file_obj:
        return words_list_from(file_obj.readlines())

def words_list_from(file_obj):
    return list(map(lambda word: word.strip().lower(), file_obj))

def files_to_lists(directory, *files):
    '''
    Receives a directory. If file names are given, those will be processed.
    Otherwise all the files in the directory will be processed.
    Return a list of lists with words.
    '''
    if files: lst_files = files # specified files are processed
    else: lst_files = listdir(directory) # all files in directory are processed
    return list(map(lambda x: file_to_words_list(pjoin(directory, x)), lst_files))

In [4]:
def has_alpha_char(word):
    '''
    If word contains alphabet characters returns True. False otherwise.
    '''
    if word: return reduce(lambda a,b: a or b, map(lambda x: x.isalpha(), word))

class TextMetrics():
    def __init__(self, total_words, wrong_words, quality_score):
        self.total_words = total_words
        self.wrong_words = wrong_words
        self.quality_score = quality_score

class TextQualityAnalyzer:
    def __init__(self, *words_lists):
        self.idiom_words = reduce(lambda lst1, lst2: lst1 + lst2, words_lists)
        self.text = ''
        self.__iter_words = None
    
    def __parse_text(self):
         self.__iter_words = map(lambda word: word.strip(NOT_ALLOWED), self.text.split(' '))
    
    def word_exist(self, word):
        '''
        If the word exist in the word list return True. False otherwise.
        '''
        word = word.lower()
        return word in self.idiom_words
        
    def __check_words_existance(self):
        wrong_words = 0
        total_words = 0
        for word in self.__iter_words:
            if has_alpha_char(word):
                wrong_words += int(not self.word_exist(word))
                total_words += 1
        return total_words, wrong_words
            
    def generate_metrics_obj(self, text):
        '''
        Receives a text and return an object with its metrics and score(int) depending of the existing words.
        '''      
        self.text = text
        self.__parse_text()
        total_words, wrong_words = self.__check_words_existance()
        quality_score = total_words // (wrong_words + 1)
        return TextMetrics(total_words, wrong_words, quality_score)


### TextQualityAnalyzer Object Creation

In [5]:
lst_words = file_to_words_list(pjoin(DIRECTORY_DICTS, ENGLISH_DICT)) # Change here the language
tqa = TextQualityAnalyzer(lst_words) # Here you can add all lists you want separated by ','

In [6]:
lst_words

['W',
 'w',
 'WW',
 'WWW',
 'WY',
 'Y',
 'y',
 'A',
 'a',
 'AA',
 'AAA',
 'aah',
 'ah',
 'AI',
 'air',
 'AR',
 'Ar',
 'Au',
 'aw',
 'E',
 'e',
 'ea',
 'ear',
 'EEO',
 "e'er",
 'eh',
 'EOE',
 'ER',
 'Er',
 'EU',
 'Eu',
 'Eur',
 'I',
 'i',
 'IA',
 'Ia',
 'IE',
 'ii',
 'iii',
 'Io',
 'IOU',
 'Ir',
 'O',
 'o',
 'oar',
 'OE',
 "o'er",
 'OH',
 'oh',
 'oi',
 'ooh',
 'OR',
 'or',
 'our',
 'ow',
 'U',
 'u',
 'UAR',
 'UAW',
 'ugh',
 'uh',
 'Ur',
 'aether',
 'Arther',
 'Arthur',
 'Aurthur',
 'author',
 'earth',
 'Eartha',
 'earthier',
 'earthy',
 'eighth',
 'either',
 'Ertha',
 'Ethe',
 'ether',
 'oath',
 'Otha',
 'other',
 'Otho',
 'earthbound',
 'Athabasca',
 'Athabaskan',
 "Athabascan's",
 "Athabaskan's",
 'Athabaskans',
 "Athabasca's",
 "Athabaska's",
 'orthorhombic',
 'ethic',
 'Ithaca',
 'earthquake',
 'earthquaking',
 "earthquake's",
 'earthquakes',
 'earthquaked',
 'ethical',
 'ethically',
 'ethicals',
 'Ithacan',
 'orthogonal',
 'orthogonally',
 'orthogonality',
 "orthogonality's",
 'ort

### Analysis of a Single Text

In [7]:
tmo = tqa.generate_metrics_obj('my personal number: 12.345.678-9 fsafdas')
tmo.quality_score

2

In [8]:
tmo.total_words

4

In [9]:
tmo.wrong_words

1

### Analysis of Multiple Text in DataFrame

In [10]:
s3 = boto3.client('s3')
comprehend_client = boto3.client('comprehend',region_name = AWS_REGION)

In [11]:
obj = s3.get_object(Bucket=bucket, Key=key2)
obj = BytesIO(obj['Body'].read())

df_example = pd.read_csv(obj, sep=",", error_bad_lines=False, warn_bad_lines=False, nrows=1000)
df_example.head(3)

Unnamed: 0,Ticket #,Customer Complaint,Date,Time,Received Via,City,State,Zip code,Status,Filing on Behalf of Someone,Description
0,250635,Comcast Cable Internet Speeds,4/22/2015,3:53:50 PM,Internet,Abingdon,Maryland,21009,Closed,No,I have been contacting Comcast Internet Techni...
1,223441,Payment disappear - service got disconnected,4/8/2015,10:22:56 AM,Internet,Acworth,Georgia,30102,Closed,No,Back in January 2015 I made 2 payments: One fo...
2,242732,Speed and Service,4/18/2015,9:55:47 AM,Internet,Acworth,Georgia,30101,Closed,Yes,Our home is located at in Acworth Georgia 3010...


In [12]:
start_time = dt.now()

series_all_text = df_example["Customer Complaint"] + df_example["Description"]
series_text_metrics_results = series_all_text.apply(tqa.generate_metrics_obj)
print(f'Analysis ended ({dt.now() - start_time})')
checkpoint = dt.now()

df_example['Total Words'] = series_text_metrics_results.apply(lambda x: x.total_words)
print(f'Total Words column created ({dt.now() - checkpoint})')
checkpoint = dt.now()

df_example['Wrong Words'] = series_text_metrics_results.apply(lambda x: x.wrong_words)
print(f'Wrong Words column created ({dt.now() - checkpoint})')
checkpoint = dt.now()

df_example['Quality Score'] = series_text_metrics_results.apply(lambda x: x.quality_score)
print(f'Quality Score column created ({dt.now() - checkpoint})')

print(f'\nTotal execution time: {dt.now() - start_time}')
print(f'Data processed: {len(series_all_text)} texts')

Analysis ended (0:04:20.676670)
Wrong Words column created (0:00:00.002620)
Wrong Words column created (0:00:00.001329)
Wrong Words column created (0:00:00.001313)

Total execution time: 0:04:20.682360
Data processed: 1000 texts


In [13]:
df_example.head(3)

Unnamed: 0,Ticket #,Customer Complaint,Date,Time,Received Via,City,State,Zip code,Status,Filing on Behalf of Someone,Description,Total Words,Wrong Words,Quality Score
0,250635,Comcast Cable Internet Speeds,4/22/2015,3:53:50 PM,Internet,Abingdon,Maryland,21009,Closed,No,I have been contacting Comcast Internet Techni...,275,20,13
1,223441,Payment disappear - service got disconnected,4/8/2015,10:22:56 AM,Internet,Acworth,Georgia,30102,Closed,No,Back in January 2015 I made 2 payments: One fo...,231,16,13
2,242732,Speed and Service,4/18/2015,9:55:47 AM,Internet,Acworth,Georgia,30101,Closed,Yes,Our home is located at in Acworth Georgia 3010...,491,43,11


## Results

### Top 10 Quality Scored Texts

In [14]:
selected_columns = ['Customer Complaint', 'Description', 'Total Words', 'Wrong Words', 'Quality Score']
df_example[selected_columns].sort_values('Quality Score', ascending=False).head(10)

Unnamed: 0,Customer Complaint,Description,Total Words,Wrong Words,Quality Score
859,"Hang-ups, Lies, Bill more than 2x higher & more",I have been a loyal customer of Comcast. I had...,1094,19,54
92,Unbelievable Treatment,I am getting overcharged with comcast / xfinit...,690,12,53
4,Comcast not working and no service to boot,I have been a customer of Comcast of some sort...,652,13,46
689,Comcast Service,Moved to a new address on 04/09/15 wherein I h...,388,8,43
964,Customer Service Nightmare,I tried to setup cable at a new apartment.\n A...,214,4,42
928,internet and service,I have been paying to get an internet speed of...,1258,33,37
283,Unauthorized billing,"I have been a Comcast customer for many years,...",541,14,36
209,Monthly Charges Increased without any notice,I have signed a two years internet service pro...,211,5,35
198,Deceptive Practices,I inadvertently git behind on my Internet bill...,161,4,32
715,Comcast issues,i have been a customer for comcast for 15 mont...,185,5,30
