# South Park data Set analysis

In [1]:
"""
Opening the .csv file
"""
different_files = ['All-seasons-south-park.csv']
search_terms = []
with open(different_files[0], encoding='utf-8') as csv_file:
    csv_file.readline()
    for line in csv_file:
        
        line_contents = line.split(',')
        for word in line_contents:
            if word != '"\n':
                search_terms.append(word)
    
    

## What is this dataset?

### This .csv file contains script information including: season, episode, character, & line from many, many south park episodes

### Hypotheses:
#### Since this file is smaller than the search terms csv file I believe that the runtime will be faster than the frequency analysis of the search terms file. This means that it should be able to do the analysis in under 10 microseconds. Since 10 microseconds is already really fast, it might be the same time.

## Importing data into a pandas dataframe

In [2]:
import pandas as pd
south_park_df = pd.DataFrame(search_terms, columns=['All Information'])
south_park_df.head()

Unnamed: 0,All Information
0,10
1,1
2,Stan
3,"""You guys"
4,you guys! Chef is going away. \n


## Cleaning function and cleaning of pandas data

In [3]:
import re
def clean_token(token):
    token = re.sub('\'', '', token)
    token = re.sub('\n', '', token)
    token = re.sub('\"', '', token)
    return token

In [4]:
south_park_df['Cleaned Tokens'] = south_park_df['All Information'].apply(clean_token)
south_park_df.head(20)

Unnamed: 0,All Information,Cleaned Tokens
0,10,10
1,1,1
2,Stan,Stan
3,"""You guys",You guys
4,you guys! Chef is going away. \n,you guys! Chef is going away.
5,10,10
6,1,1
7,Kyle,Kyle
8,"""Going away? For how long?\n",Going away? For how long?
9,10,10


## Frequency analysis methods and sorting the dictionary

In [5]:
def frequency_dict(search_terms):
    """
    Parameter: list of searchterms
    Return: Dictionary frequency number of words
    Return key: words
    Return values: number of times in list
    """
    seen = set()
    frequency_search_terms = {}
    for i in range(len(search_terms)):
        if search_terms[i] in seen:
            frequency_search_terms[search_terms[i]] += 1
        else:
            frequency_search_terms[search_terms[i]] = 1
            seen.add(search_terms[i])
    
    return frequency_search_terms


In [6]:
def sorting_the_dict(frequency_dict):
    """
    Sorts dictionary by values from high to low
    Param: Dictionary with number values
    Return: Sorted Dictionary values from high to low
    """
    sorted_dict = {}
    marklist = sorted(frequency_dict.items(), key=lambda x:x[1], reverse=True)
    sort_dict = dict(marklist)
    return sort_dict


## Runtime anmalysis of frequency_dict method and value_counts method

In [7]:
%time
frequency_dictionary = frequency_dict(search_terms)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.53 µs


In [8]:
%time
sorted_dict = sorting_the_dict(frequency_dictionary)

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 3.34 µs


In [9]:
%time
south_park_df['Cleaned Tokens'].value_counts()

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 2.86 µs


2                                                                                                                       11187
4                                                                                                                       10783
3                                                                                                                       10515
6                                                                                                                        9945
Cartman                                                                                                                  9912
                                                                                                                        ...  
Kennys parents must be laughing pretty hard about now! Were dumb enough to believe Kennys body could be in a teapot!        1
 that is- that is the most beautiful thing I have ever heard.                                                         

In [10]:
for key in sorted_dict:
    #print(f'Key: {key},       Value: {sorted_dict[key]}')
    pass

# Conclusions:
### The cell runtimes were consistently around 5 microseconds, which is in line with what my hypothesis predicted. The difference between 5 microseconds and 10 microseconds is so little that the difference can be attributed to many outside factors. Another thing that can be noted is that the runtimes between value_counts from pandas and the frequency counter method were similar once again. 
### The hypothesis lines up with the outcomes of the tests I ran.

### One thing that can be said is that the method I made and the pandas value_counts method are very fast, both can complete this analysis in almost no time. 

# Some interesting facts that can be taken from the analysis:

### Who has the most lines:

#### 1. Cartman
#### 2. Stan
#### 3. Kyle
#### 4. Butters
#### 5. Randy

### What are the most common lines:

#### 1. 'Oh'
#### 2. 'Well'
#### 3. 'Yeah'
#### 4. 'No'
#### 5. 'Dude'

