Step 1. Ensure that you have the dataset file named `transactions.csv` in the current directory.

The dataset is a subset of https://www.kaggle.com/ealaxi/paysim1/version/2 which was originally generated as part of the following research:

E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

Step 2. Complete the following exercises.

0. Read the dataset (`transactions.csv`) as a Pandas dataframe. Note that the first row of the CSV contains the column names.

0. Return the column names as a list from the dataframe.

0. Return the first k rows from the dataframe.

0. Return a random sample of k rows from the dataframe.

0. Return a list of the unique transaction types.

0. Return a Pandas series of the top 10 transaction destinations with frequencies.

0. Return all the rows from the dataframe for which fraud was detected.

0. Bonus. Return a dataframe that contains the number of distinct destinations that each source has interacted with to, sorted in descending order. You will find [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) useful. The predefined aggregate functions are under `pandas.core.groupby.GroupBy.*`. See the [left hand column](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.nunique.html).

Use the empty cell to test the exercises. If you modify the original `df`, you can rerun the cell containing `exercise_0`.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd


def exercise_0(file):
    return pd.read_csv(file)

def exercise_1(df):
    return df.columns

def exercise_2(df, k):
    return df[:k]

def exercise_3(df, k):
    return df.sample(k)

def exercise_4(df):
    return df.value_counts('type')

def exercise_5(df):
    return df.value_counts('nameDest')[:10]

def exercise_6(df):
    return df.loc[df['isFraud'] == 1]

def exercise_7(df):
    df = df.groupby('nameOrig').apply(lambda x: x.nameDest.nunique())
    return df.sort_values(ascending=False)
def visual_1(df):
    pass

def visual_2(df):
    pass

def exercise_custom(df):
    pass
    
def visual_custom(df):
    pass

In [None]:
df = exercise_0('/home/data/transactions.csv')

In [None]:
a = pd.DataFrame({'origin': ['a', 'b', 'c', 'a'], 'dest': ['d', 'e', 'f', 'd']})
a.groupby('origin').apply(lambda x: x.dest.nunique())

In [None]:
# Test exercises here
ex1 = exercise_1(df)
ex1

In [None]:
ex2 = exercise_2(df, 3)
ex2

In [None]:
ex3 = exercise_3(df, 3)
ex3

In [None]:
ex4 = exercise_4(df)
ex4

In [None]:
ex5 = exercise_5(df)
ex5

In [None]:
ex6 = exercise_6(df)
ex6

In [None]:
ex7 = exercise_7(df)
ex7

Create graphs for the following. 
1. Transaction types bar chart, Transaction types split by fraud bar chart
1. Origin account balance delta v. Destination account balance delta scatter plot for Cash Out transactions

Ensure that the graphs have the following:
 - Title
 - Labeled Axes
 
The function plot the graph and then return a string containing a short description explaining the relevance of the chart.

In [None]:
def visual_1(df):
    def transaction_counts(df):
        return df.value_counts('type')

    def transaction_counts_split_by_fraud(df):
        return df.loc[df['isFraud'] == 1].value_counts('type')

    fig, axs = plt.subplots(2, figsize=(6,10))
    transaction_counts(df).plot(ax=axs[0], kind='bar')
    axs[0].set_title('Transaction types bar chart')
    axs[0].set_xlabel('Type of transaction')
    axs[0].set_ylabel('Number of transactions')
    transaction_counts_split_by_fraud(df).plot(ax=axs[1], kind='bar')
    axs[1].set_title('Transaction types split by fraud bar chart')
    axs[1].set_xlabel('Type of transaction')
    axs[1].set_ylabel('Number of transactions')
    fig.suptitle('Number of transactions and fraudulent transactions by type')
    fig.tight_layout(rect=[0, 0.03, 1, 0.95])
    for ax in axs:
      for p in ax.patches:
          ax.annotate(p.get_height(), (p.get_x(), p.get_height()))
    return 'This plot allows us to draw inferences about trends in fraudulent transaction types'

visual_1(df)


In [None]:
def visual_2(df):
    def query(df):
        df = df.loc[df['type'] == 'CASH_OUT']
        orig_delta = []
        dest_delta = []
        for index, row in df.iterrows():
            orig_delta.append(row['oldbalanceOrg'] - row['newbalanceOrig'])
            dest_delta.append(row['newbalanceDest'] - row['oldbalanceDest'])
        df = pd.DataFrame({'orig_delta': orig_delta, 'dest_delta': dest_delta})
        # print(df)
        return df
    plot = query(df).plot.scatter(x='orig_delta',y='dest_delta')
    plot.set_title('Difference from origin and destination accounts in Cash Out transcations')
    plot.set_xlim(left=-1e3, right=1e3)
    plot.set_ylim(bottom=-1e3, top=1e3)
    return 'A plot that allows us to identify how the difference from origin and destination accounts in Cash Out transcations relate to each other, making it easier to spot patters and outliers'
# Origin account balance delta v. Destination account balance delta scatter plot for Cash Out transacti
visual_2(df)


Use your newly-gained Pandas skills to find an insight from the dataset. You have full flexibility to go in whichever direction interests you. Please create a visual as above for this query. `visual_custom` should call `exercise_custom`.

In [None]:
def exercise_custom(df):
    def mean(list):
        return sum(list) / len(list)
    fraud = df.loc[df['isFraud'] == 1]
    df_transaction= fraud.loc[df['type'] == 'TRANSFER']
    df_cash_out = fraud.loc[df['type'] == 'CASH_OUT']
    amount_transaction = []
    amount_cash_out = []
    for index, row in df_transaction.iterrows():
        amount_transaction.append(row['amount'])
    for index, row in df_cash_out.iterrows():
        amount_cash_out.append(row['amount'])
    # make them the same lenth
    i = 0
    while len(amount_transaction) != len(amount_cash_out):
        if len(amount_transaction) > len(amount_cash_out):
            amount_cash_out.append(mean(amount_cash_out))
        else:
            amount_transaction.append(mean(amount_transaction))
        i += 1

    print(f'applied mean to {i} elements')
    
    df = pd.DataFrame({'amount_transaction': amount_transaction, 'amount_cash_out': amount_cash_out})
    return df

def visual_custom(df):
    plot = exercise_custom(df).plot.scatter(x='amount_transaction',y='amount_cash_out')
    plot.set_title('Distribution of transaction amounts in different types of fraudulent transactions')
    plot.set_xlim()
    plot.set_ylim()
    return 'Distribution of transaction amounts in different types of fraudulent transactions'
pd.options.display.float_format = '{:.2f}'.format
    


In [None]:
visual_custom(df)

In [None]:
def bonus(df):
    df = df.loc[df['isFraud'] == 1].value_counts('amount')
    plot = df.plot(kind='hist')
    plot.set_title('Fraudulent transactions with same amounts histogram')
    plot.set_xlabel('Number of transactions with same amount')
    plot.set_ylabel('Number of transactions')
    return 'A way to spot trends in fraudulent transactions with the same amount'

In [None]:
bonus(df)

Submission

1. Copy the exercises into `task1.py`.
2. Upload `task1.py` to Forage.

All done!

Your work will be instrumental for our team's continued success.