# <center>Assignment 2</center>

## Q1. Define a function to analyze a numpy array (5 points)
 - Assume we have an array (with shape (M,N)) which contains term frequency of each document, where each row is a document, each column is a word, and the corresponding value denotes the frequency of the word in the document. Define a function named "analyze_tf_idf" which: (overall 1 point)
      * takes the **array**, and an integer **K** as the parameters.
      * normalizes the frequency of each word as: word frequency divided by the length of the document. Save the result as an array named **tf** (i.e. term frequency) (1 point)
      * calculates the document frequency (**df**) of each word, e.g. how many documents contain a specific word (1 point)
      * calculates **tf_idf** array as: **tf / (log(df)+1)** (tf divided by log(df)). The reason is, if a word appears in most documents, it does not have the discriminative power and often is called a "stop" word. The inverse of df can downgrade the weight of such words. (1 point)
      * for each document, finds out the **indexes of words with top K largest values in the tf_idf array**, ($0<K<=N$). These indexes form an array, say **top_K**, with shape (M, K) (1 point)
      * returns the tf_idf array, and the top_K array. 
 - Note, for all the steps, ** do not use any loop**. Just use array functions and broadcasting for high performance computation.

## Q2. Define a function to analyze stackoverflow dataset using pandas (5 points)
 - Define a function named "analyze_data" to do the follows:
   * Take a csv file path string as an input. Assume the csv file is in the format of the provided sample file (question.csv).
   * Read the csv file as a dataframe with the first row as column names
   * Find questions with top 3 viewcounts among those answered questions (i.e answercount>0). Print the title and viewcount columns of these questions. (1 point)
   * Find the top 5 users (i.e. quest_name) who asked the most questions. (1 point)
   * Create a new column called "first_tag" to store the very first tag in the "tags" column (hint: use "apply" function; tags are separted by ", ") (1 point)
   * Show the mean, min, and max viewcount values for each of these tags: "python", "pandas" and "dataframe" (1 point)
   * Create a cross tab with answercount as row indexes, first_tag as column names, and the count of samples as the value. For "python" question (i.e. first_tag="python"), how many questions were not answered (i.e., answercount=0), how many questions were answered once (i.e., answercount=1), and how many questions were anasered twice  (i.e., answercount=2)? Print these numbers. (1 point)
 - This function does not have any return. Just print out the result of each calculation step.

## Q3 (Bonus). Analyzed a collection of documents (3 points)
 - Define a function named "analyze_corpus" to do the follows:
   * Similar to Q2, take a csv file path string as an input. Assume the csv file is in the format of the provided sample file (question.csv).
   * Read the "title" column from the csv file and convert it to lower case
   * Split each string in the "title" column by space to get tokens. Create an array where each row represents a title, each column denotes a unique token, and each value denotes the count of the token in the document (2 points)
   * Call your function in Q1 (i.e. analyze_tf_idf) to analyze this array
   * Print out the top 5 words by tf-idf score for the first 20 questions. Do you think these top words allow you to find similar questions or differentiate a question from dissimilar ones? Write your analysis as a pdf file. (1 point)
   
- This function does not have any return. Just print out the result if asked.
   

## Submission Guideline##
- Following the solution template provided below. Use __main__ block to test your functions
- Save your code into a python file (e.g. assign2.py) that can be run in a python 3 environment. In Jupyter Notebook, you can export notebook as .py file in menu "File->Download as".
- Make sure you have all import statements. To test your code, open a command window in your current python working folder, type "python assign2.py" to see if it can run successfully.
- **Each homework assignment should be completed independently. Never ever copy others' work**

In [2]:
# Structure of your solution to Assignment 1 

import numpy as np
import csv
import pandas as pd

import pandas as pd
import numpy as np



def analyze_data(filepath):
    
    data = pd.read_csv(filepath)

    print(data[data["answercount"]>=1].sort_values(by="viewcount", ascending=False).iloc[0:3])
    print(data["quest_name"].value_counts().iloc[0:5])
    data["first_tag"]=data["tags"].apply(lambda x: x.split(", ")[0])
    print(data[data["first_tag"].isin(["python","pandas","dataframe"])].groupby("first_tag")["viewcount"].agg([np.min, np.max, np.mean]))
    print(pd.crosstab(index=data.answercount, columns=data.first_tag)["python"].loc[0:3])
    
def analyze_tf_idf(arr,K):
    
    tf=(arr.T)/1.0*(np.sum(arr, axis=1))

    # get df
    df=np.sum(np.where(tf>0, 1, 0), axis=1)

    # get tf_idf
    tf_idf=(tf.T)/(np.log(df)+1)

    #return index of top_3 words
    top_k=np.argsort(tf_idf)[:,::-1][:,0:K]
    
    return tf_idf, top_k


def analyze_corpus(filepath):
    
    data = pd.read_csv(filepath)
    
    doc_tokens={}
    for idx, doc in enumerate(data["title"].values):
        tokens= doc.lower().split()
        token_count={w:tokens.count(w) for w in set(tokens)}
        doc_tokens[idx]=token_count
    
    doc_df=pd.DataFrame.from_dict(doc_tokens, orient="index")
    doc_df= doc_df.fillna(0)
    
    tf_idf, top_k=analyze_tf_idf(doc_df.values, 5)
    
    for i in range(20):
        print(data.iloc[i]["title"], [doc_df.columns[j] for j in top_k[i]])
   


# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    # Test Question 1
    arr=np.array([[0,1,0,2,0,1],[1,0,1,1,2,0],[0,0,2,0,0,1]])
    
    print("\nQ1")
    tf_idf, top_k=analyze_tf_idf(arr,3)
    print(top_k)
    
    print("\nQ2")
    print(analyze_data('../../dataset/question.csv'))
    
    # test question 3
    print("\nQ3")
    analyze_corpus('../../dataset/question.csv')


Q1
[[3 1 5]
 [4 0 3]
 [2 5 4]]

Q2
           id         creationdate  score  viewcount  \
75   48066517  2018-01-02 19:12:03     24      33297   
163  48094854  2018-01-04 12:01:18      3      16658   
886  48350850  2018-01-19 23:12:38      3      11176   

                                                 title  answercount  \
75   Python: Pandas pd.read_excel giving ImportErro...            7   
163                     Python convert object to float            2   
886                  Subtract two columns in dataframe            5   

                                  tags        quest_name  
75   python, excel, python-2.7, pandas       Vineeth Sai  
163                     python, pandas  Almog Woldenberg  
886                     python, pandas             Peter  
Shuvayan Das    7
Rahul rajan     7
el323           6
Danny W         6
Hana            5
Name: quest_name, dtype: int64
           amin   amax        mean
first_tag                         
pandas       14   4499  454