# Data Analysis

Before you start, make sure that you are familiar with the basic usage of Jupyter Notebook. 

If not, please finish the Jupyter Notebook primer first. Within that primer you can find links to some starter notebooks hosted on Google Colab that will help you practice Linux, Bash, and Pandas fundamentals with **worked examples**.

In this task, you need to implement the following methods:
```
load_data_to_series
q6
q7
q8
q9
```

Please implement the `load_data_to_series` method first. You can check the output for each question by executing the cell below the question.

More cells may be added to the notebook. If you don't want to include the cell in the converted script, please tag the cell with `excluded_from_script`. You can display the tags for each cell in Jupyter Notebook: `View > Cell Toolbar > Tags`.

Execute `./runner.sh` in the console to check the result. Please make sure that the virtualenv is activated when `runner.sh` runs.

Finally, remember the write-up section regarding encoding awareness and ensure that you practice those concepts when completing the required questions within this notebook.

# Pandas

Pandas is a Python library for practical and real-world data analysis. It provides fast, flexible, and expressive data structures to make it easy to work with data. 

There are two primary data structures provided by Pandas, Series (1-dimensional) and DataFrame (2-dimensional). This week you will start with the Series. And you will practice with the DataFrame next week.

# Series

A Series is a one-dimensional labeled array that can hold any data type, integer, floating point number, Python objects, etc. In this task, you will load the filtered Wikipedia output into a Series, where the page title is the label and view count is the data.

In [1]:
import pandas as pd
import sys
import argparse

In [None]:
def load_data_to_series(input_file):
    """
    Load the input file into a Series.
    Please read the documentation of the method `pandas.read_csv`:
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=index_col

    The default behavior of read_csv will infer the column names using the first line of the input file.
    In the provided Wikipedia dataset, the first line of the input file is not the column names; 
    hence, you need to change this default behavior.
    
    Hint: 
    1. How to read a TSV file using the read_csv method by specifying the delimiter
    2. How to not infer the first line as the column names
    3. How to read the data into a Series instead of a Dataframe
    4. How to specify the column to be used as the row labels
    
    :param input_file: the path to the input file
    :return: the Series
    """
    
    # TODO: Load the input_file in to a Series
    df = pd.read_csv(input_file, sep='\t', header=None, index_col=0)
    series = df.iloc[:, 0]
    
    return series

In [1]:
def q6():
    """
    Print a small sample of a Series as CSV
    
    To view the top n records, read the documentation:
    https://pandas.pydata.org/pandas-docs/stable/basics.html#head-and-tail
    
    output format:
    <page title>,<page view>
    <page title>,<page view>
    <page title>,<page view>
    ...
    
    """
    
    # read the output into series
    series = load_data_to_series("output")
    
    # TODO: replace "None" with your implementation to select the first 10 records
    res=series.head(10)
    
    # print the result to standard output in the CSV format
    res.to_csv(sys.stdout, header=None)

In [23]:
q6()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 917507, saw 4


In [2]:
def q7():
    """
    Get values by index label
    
    Since the page title is the label of the Series, you can get the page view by page title.
    Please read the documentation:
    https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series-is-dict-like
    
    output format:
    <page view>
    """
    
    # read the output into a Series 
    series = load_data_to_series("output")
    res = series["Cloud_computing"]
    print(res)

In [None]:
q7()

In [3]:
def q8():
    """
    Generates descriptive statistics of a Series
    
    Please read the documentation of the Series and find the method to show all the descriptive statistics.
    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html?highlight=descriptive
    
    output format:
    
    count,<number>
    mean,<number>
    std,<number>
    min,<number>
    25%,<number>
    50%,<number>
    75%,<number>
    max,<number>
    
    """
    
    # read the output into a Series
    series = load_data_to_series("output")
    res = series.describe()
    res.to_csv(sys.stdout, encoding='utf-8', header=None)

In [None]:
q8()

In [4]:
def q9():
    """
    Data filtering in Series
    
    Boolean indexing can be used to filtering data in a Series.
    https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
    
    output format:
    <page title>,<page view>
    <page title>,<page view>
    <page title>,<page view>
    ...
    
    """
    
    # read the output into a Series
    series = load_data_to_series("output")
    res = series[(series >= 2000) & (series < 3000)]
    res.to_csv(sys.stdout, encoding='utf-8', header=None)

In [None]:
q9()

# DO NOT MODIFY ANYTHING BELOW  

In [None]:
def main():
    parser = argparse.ArgumentParser(
    description="Data Analysis")
    parser.add_argument("-r",
                        metavar='<question_id>',
                        required=False)
    args = parser.parse_args()
    question = args.r

    if question is None:
        q6()
        q7()
        q8()
        q9()
    elif question == "q6":
        q6()
    elif question == "q7":
        q7()
    elif question == "q8":
        q8()
    elif question == "q9":
        q9()
    else:
        print("Invalid question")
        
if __name__ == "__main__":
    main()