# Data Analysis

Before you start, make sure that you are familiar with the basic usage of Jupyter Notebook. 

If not, please finish the Jupyter Notebook primer first. Within that primer you can find links to some starter notebooks hosted on Google Colab that will help you practice Linux, Bash, and Pandas fundamentals with **worked examples**.

In this task, you need to implement the following methods:
```
load_data_to_series
q6
q7
q8
q9
```

Please implement the `load_data_to_series` method first. You can check the output for each question by executing the cell below the question.

More cells may be added to the notebook. If you don't want to include the cell in the converted script, please tag the cell with `excluded_from_script`. You can display the tags for each cell in Jupyter Notebook: `View > Cell Toolbar > Tags`.

Execute `./runner.sh` in the console to check the result. Please make sure that the virtualenv is activated when `runner.sh` runs.

Finally, remember the write-up section regarding encoding awareness and ensure that you practice those concepts when completing the required questions within this notebook.

# Pandas

Pandas is a Python library for practical and real-world data analysis. It provides fast, flexible, and expressive data structures to make it easy to work with data. 

There are two primary data structures provided by Pandas, Series (1-dimensional) and DataFrame (2-dimensional). This week you will start with the Series. And you will practice with the DataFrame next week.

# Series

A Series is a one-dimensional labeled array that can hold any data type, integer, floating point number, Python objects, etc. In this task, you will load the filtered Wikipedia output into a Series, where the page title is the label and view count is the data.

In [51]:
import pandas as pd
import sys
import argparse

In [52]:
def load_data_to_series(input_file):
    """
    Load the input file into a Series.
    Please read the documentation of the method `pandas.read_csv`:
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=index_col

    The default behavior of read_csv will infer the column names using the first line of the input file.
    In the provided Wikipedia dataset, the first line of the input file is not the column names; 
    hence, you need to change this default behavior.
    
    Hint: 
    1. How to read a TSV file using the read_csv method by specifying the delimiter
    2. How to not infer the first line as the column names
    3. How to read the data into a Series instead of a Dataframe
    4. How to specify the column to be used as the row labels
    
    :param input_file: the path to the input file
    :return: the Series
    """
    
    dataFrame = pd.read_csv(
            input_file,
            sep='\t',
            header=None,
            index_col=0
        )

    return dataFrame[1]

In [53]:
def q6():
    """
    Print a small sample of a Series as CSV
    
    To view the top n records, read the documentation:
    https://pandas.pydata.org/pandas-docs/stable/basics.html#head-and-tail
    
    output format:
    <page title>,<page view>
    <page title>,<page view>
    <page title>,<page view>
    ...
    
    """
    
    # read the output into series
    series = load_data_to_series("output")
    
    # TODO: replace "None" with your implementation to select the first 10 records
    res = series.head(10)
    
    # print the result to standard output in the CSV format
    res.to_csv(sys.stdout, header=None)

In [54]:
q6()

Martin_Shkreli,16605
Pamela_Anderson,12338
Damarious_Randall,11394
XHamster,11355
Tyrod_Taylor,9959
Gianni_Versace,7799
2018_Winter_Paralympics,7728
Black_Panther_(film),7005
A_Wrinkle_in_Time,5520
A_Wrinkle_in_Time_(2018_film),5316


In [57]:
def q7():
    """
    Get values by index label
    
    Since the page title is the label of the Series, you can get the page view by page title.
    Please read the documentation:
    https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series-is-dict-like
    
    output format:
    <page view>
    """
    
    # read the output into a Series 
    dataFrame = load_data_to_series("output")
    
    # TODO: replace "None" with your implementation to select the row with the title "Cloud_computing"
    # raise NotImplementedError("To be implemented")
    res = dataFrame["Cloud_computing"]
    
    # print the result to standard output
    print(res)

In [58]:
q7()

340


In [60]:
def q8():
    """
    Generates descriptive statistics of a Series
    
    Please read the documentation of the Series and find the method to show all the descriptive statistics.
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html?highlight=descriptive
    
    output format:
    
    count,<number>
    mean,<number>
    std,<number>
    min,<number>
    25%,<number>
    50%,<number>
    75%,<number>
    max,<number>
    
    """
    
    # read the output into a Series
    series = load_data_to_series("output")
    
    # TODO: generate the descriptive statistics (replace "None" with your implementation)
    # raise NotImplementedError("To be implemented")
    res = series.describe()
    
    # print the result to standard output in csv format
    res.to_csv(sys.stdout, encoding='utf-8', header=None)

In [61]:
q8()

count,1563419.0
mean,5.802378632983225
std,39.056515205675964
min,1.0
25%,1.0
50%,2.0
75%,4.0
max,16605.0


In [63]:
def q9():
    """
    Data filtering in Series
    
    Boolean indexing can be used to filtering data in a Series.
    https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
    
    output format:
    <page title>,<page view>
    <page title>,<page view>
    <page title>,<page view>
    ...
    
    """
    
    # read the output into a Series
    dataFrame = load_data_to_series("output")
    
    # TODO: replace "None" with your implementation to
    #       select the row with view_count greater or equal to 2000 and less than 3000 
    #       i.e., [2000, 3000)
    # raise NotImplementedError("To be implemented")
    res = dataFrame[(dataFrame >= 2000) & (dataFrame < 3000)]
    
    # print the result to standard output in the CSV format
    res.to_csv(sys.stdout, encoding='utf-8', header=None)

In [64]:
q9()

Michael_Avenatti,2823
Bruno_Mars,2822
KDND,2775
Exo_(band),2767
Suicide_of_Jacintha_Saldanha,2753
Richard_Sherman_(American_football),2715
Kim_Jong-un,2670
Donatella_Versace,2654
The_Shape_of_Water_(film),2613
Lisa_Bonet,2526
Tommy_Lee,2516
"Three_Billboards_Outside_Ebbing,_Missouri",2407
Tonya_Harding,2382
Google,2354
List_of_Marvel_Cinematic_Universe_films,2339
Null,2331
Logic_(musician),2308
Joey_Lawrence,2300
The_Seekers,2294
Judith_Durham,2285
Sarah_Huckabee_Sanders,2274
California_Proposition_218_(1996),2229
Julian_Assange,2228
Daylight_saving_time,2196
The_Shape_of_Water,2189
Gal_Gadot,2164
XXX_(franchise),2109
Jo_Min-ki,2108
The_Righteous_Brothers,2051
Call_Me_by_Your_Name_(film),2009
Dwayne_Johnson,2009
The_Legend_of_Lylah_Clare,2003
Jason_Momoa,2001


# DO NOT MODIFY ANYTHING BELOW  

In [50]:
def main():
    parser = argparse.ArgumentParser(
    description="Data Analysis")
    parser.add_argument("-r",
                        metavar='<question_id>',
                        required=False)
    args = parser.parse_args()
    question = args.r

    if question is None:
        q6()
        q7()
        q8()
        q9()
    elif question == "q6":
        q6()
    elif question == "q7":
        q7()
    elif question == "q8":
        q8()
    elif question == "q9":
        q9()
    else:
        print("Invalid question")
        
if __name__ == "__main__":
    main()

usage: ipykernel_launcher.py [-h] [-r <question_id>]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/mac/Library/Jupyter/runtime/kernel-e457e5ee-9520-486f-b5a6-3b38484fdc87.json


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
