##### DataFrames 

##### Introduction to DataFrames in Pandas

>> Components of a Dataframe - They are Three Components
 - **.values**  -  is a way to access the underlying data of a DataFrame or Series as a NumPy array.

    When you have a DataFrame data, data.values returns a two-dimensional NumPy array representing the values in the DataFrame. Each row corresponds to a row in the DataFrame, and each column corresponds to a column in the DataFrame.

    Similarly, when you have a Series data, data.values returns a one-dimensional NumPy array representing the values in the Series.

    Accessing the values directly as a NumPy array can be useful when you want to perform numerical operations or use NumPy's functions on the data without considering the index or column labels of the DataFrame or Series.

- **.columns**  - refers to an attribute that returns the column labels of a DataFrame. When you have a DataFrame data, data.columns will return an Index       object    containing the column labels.

    This attribute is particularly useful when you want to access, iterate over, or manipulate the column labels programmatically. You can access individual columns by indexing data with the column label or use methods like .loc[] or .iloc[] to access specific rows and columns based on their labels or positions.

- **.index**    - refers to an attribute that provides access to the index labels of a DataFrame or Series. When you have a DataFrame data, data.index returns an Index object containing the index labels.

    The index labels represent the labels assigned to each row of the DataFrame. They can be integer-based, datetime-based, or even string-based, depending on how the DataFrame was created or manipulated.

    You can use data.index to access, iterate over, or manipulate the index labels programmatically. For example, you can access specific rows using .loc[] or .iloc[] methods based on their index labels or positions.

#### Dataset Source
https://www.kaggle.com/datasets/kartik2112/fraud-detection

In [17]:
import pandas as pd

##### Loading the dataset and trying to do some exploration

In [18]:
data = pd.read_csv('/home/nacre/Python/Pandas/datasets/fraudTest.csv')

# Print the head of the dataset to see how data is distributed
# Excpected out is the first 3 rows from index 0 to index 2

data.head(3)

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,1968-03-19,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",1990-01-17,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",1970-10-21,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0


#### Inspecting the DataFrame further

In [19]:
# Inspecting the last three rows of the dataset
data.tail(3)

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
555716,555716,2020-12-31 23:59:15,6011724471098086,fraud_Rau-Robel,kids_pets,86.88,Ann,Lawson,F,144 Evans Islands Apt. 683,...,46.1966,-118.9017,3684,Musician,1981-11-29,6c5b7c8add471975aa0fec023b2e8408,1388534355,46.65834,-119.715054,0
555717,555717,2020-12-31 23:59:24,4079773899158,fraud_Breitenberg LLC,travel,7.99,Eric,Preston,M,7020 Doyle Stream Apt. 951,...,44.6255,-116.4493,129,Cartographer,1965-12-15,14392d723bb7737606b2700ac791b7aa,1388534364,44.470525,-117.080888,0
555718,555718,2020-12-31 23:59:34,4170689372027579,fraud_Dare-Marvin,entertainment,38.13,Samuel,Frey,M,830 Myers Plaza Apt. 384,...,35.6665,-97.4798,116001,Media buyer,1993-05-10,1765bb45b3aa3224b4cdcb6e7a96cee3,1388534374,36.210097,-97.036372,0


In [20]:
# Check the datatypes and if we have any missing values
# Expected output is that there is no missing values from the dataset as it was randomly generated
print("Check for Missing values and the Datatypes \n \n")
data.info()

Check for Missing values and the Datatypes 
 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  fl

In [21]:
print("The dataset has the following number of rows and columns respectively")
data.shape

The dataset has the following number of rows and columns respectively


(555719, 23)

In [22]:
print("Below id the summary Statistics\n \n",  data.describe().T)

Below id the summary Statistics
 
                count          mean           std           min           25%  \
Unnamed: 0  555719.0  2.778590e+05  1.604224e+05  0.000000e+00  1.389295e+05   
cc_num      555719.0  4.178387e+17  1.309837e+18  6.041621e+10  1.800429e+14   
amt         555719.0  6.939281e+01  1.567459e+02  1.000000e+00  9.630000e+00   
zip         555719.0  4.884263e+04  2.685528e+04  1.257000e+03  2.629200e+04   
lat         555719.0  3.854325e+01  5.061336e+00  2.002710e+01  3.466890e+01   
long        555719.0 -9.023133e+01  1.372178e+01 -1.656723e+02 -9.679800e+01   
city_pop    555719.0  8.822189e+04  3.003909e+05  2.300000e+01  7.410000e+02   
unix_time   555719.0  1.380679e+09  5.201104e+06  1.371817e+09  1.376029e+09   
merch_lat   555719.0  3.854280e+01  5.095829e+00  1.902742e+01  3.475530e+01   
merch_long  555719.0 -9.023138e+01  1.373307e+01 -1.666716e+02 -9.690513e+01   
is_fraud    555719.0  3.859864e-03  6.200784e-02  0.000000e+00  0.000000e+00   

    

In [23]:
#### Exploring the different parts of the DataFrame

In [24]:
# Values attribute of the dataFrame
data.values

array([[0, '2020-06-21 12:14:25', 2291163933867244, ..., 33.986391,
        -81.200714, 0],
       [1, '2020-06-21 12:14:33', 3573030041201292, ..., 39.450498,
        -109.960431, 0],
       [2, '2020-06-21 12:14:53', 3598215285024754, ..., 40.49581,
        -74.196111, 0],
       ...,
       [555716, '2020-12-31 23:59:15', 6011724471098086, ..., 46.65834,
        -119.715054, 0],
       [555717, '2020-12-31 23:59:24', 4079773899158, ..., 44.470525,
        -117.080888, 0],
       [555718, '2020-12-31 23:59:34', 4170689372027579, ..., 36.210097,
        -97.036372, 0]], dtype=object)

In [25]:
# Columns Check 
# Expected output is the colum headers that we have in the dataset
data.columns

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

- Provides an intuitive way of checking out the dataset at a glance

In [26]:
#Expected Output is the total number of the rows in  the dataset
print("The number of rows in the dataset are in the: \n \n",data.index)

The number of rows in the dataset are in the: 
 
 RangeIndex(start=0, stop=555719, step=1)


#### Sorting and Subsetting

- **Sorting**

Sorting involves arranging the data in either a DataFrame or a Series based on one or more columns or indices. This is particularly useful for organizing the data in ascending or descending order, making it easier to identify patterns or find specific values.

In Pandas, you can sort data using the **sort_values()** method for DataFrame or Series objects. This method allows you to specify one or more columns by which the data should be sorted. Additionally, you can control the sorting order (ascending or descending) using the ascending parameter.

- **Subsetting**

Subsetting involves selecting a subset of rows or columns from a DataFrame based on certain criteria. This allows you to focus on specific parts of the data that are relevant to your analysis or task.

In Pandas, you can subset data using various methods:

-  Selection by label: You can use **.loc[]** to select rows or columns by their labels.
-  Selection by position: You can use **.iloc[]** to select rows or columns by their positions (integer indices).
-  Boolean indexing: You can use boolean expressions to filter rows based on certain conditions.




#### Sorting

In [27]:
# Perform sorting of the dataset by is_fraud followed by the dob
# Follow up - Create a plot that shows the relationship between the date of birth and fraudulent activities

# Since the dataset id huge we can only check at the first five records

fraud_by_age = data.sort_values(["is_fraud", "dob"], ascending=False)

print(fraud_by_age.head())

        Unnamed: 0 trans_date_trans_time           cc_num  \
273050      273050   2020-09-30 07:18:02  180020605265701   
273950      273950   2020-09-30 20:41:56  180020605265701   
274054      274054   2020-09-30 22:10:03  180020605265701   
274086      274086   2020-09-30 22:34:21  180020605265701   
274143      274143   2020-09-30 23:17:10  180020605265701   

                                 merchant       category      amt first  \
273050      fraud_Reilly, Heaney and Cole  gas_transport    11.50  John   
273950                    fraud_Price Inc   shopping_net  1017.05  John   
274054    fraud_Labadie, Treutel and Bode   shopping_net   943.12  John   
274086                   fraud_Rempel Inc   shopping_net   752.02  John   
274143  fraud_Larson, Quitzon and Spencer         travel    10.03  John   

         last gender              street  ...      lat     long  city_pop  \
273050  Lewis      M  7908 Derrick Mount  ...  39.8616 -97.1825       314   
273950  Lewis      M  7908 D

In [28]:
# Sort by dob descending and is_fraud ascending
# Purpose of sorting by two columns is in case we have values that are the same
data_sorted = data.sort_values(["is_fraud","dob"], ascending=[True, False])
print(data_sorted.head())

      Unnamed: 0 trans_date_trans_time          cc_num  \
610          610   2020-06-21 15:40:05  36485887555770   
643          643   2020-06-21 15:49:29  36485887555770   
1033        1033   2020-06-21 18:02:40  36485887555770   
1166        1166   2020-06-21 18:53:46  36485887555770   
3084        3084   2020-06-22 08:16:53  36485887555770   

                                 merchant        category     amt    first  \
610    fraud_Turcotte, Batz and Buckridge  health_fitness    2.30  Michael   
643                   fraud_Dickinson Ltd        misc_pos   30.91  Michael   
1033    fraud_Turcotte, McKenzie and Koss   entertainment   72.70  Michael   
1166                  fraud_Dickinson Ltd        misc_pos  163.09  Michael   
3084  fraud_Greenholt, Jacobi and Gleason   gas_transport   53.97  Michael   

       last gender                    street  ...      lat     long  city_pop  \
610   Gross      M  230 Ryan Tunnel Apt. 025  ...  40.4971 -82.8342       267   
643   Gross      M  

In [29]:
data_sort_column = data.sort_values(["trans_date_trans_time"],ascending=False)

print(data_sort_column.head(3))

        Unnamed: 0 trans_date_trans_time            cc_num  \
555718      555718   2020-12-31 23:59:34  4170689372027579   
555717      555717   2020-12-31 23:59:24     4079773899158   
555716      555716   2020-12-31 23:59:15  6011724471098086   

                     merchant       category    amt   first     last gender  \
555718      fraud_Dare-Marvin  entertainment  38.13  Samuel     Frey      M   
555717  fraud_Breitenberg LLC         travel   7.99    Eric  Preston      M   
555716        fraud_Rau-Robel      kids_pets  86.88     Ann   Lawson      F   

                            street  ...      lat      long  city_pop  \
555718    830 Myers Plaza Apt. 384  ...  35.6665  -97.4798    116001   
555717  7020 Doyle Stream Apt. 951  ...  44.6255 -116.4493       129   
555716  144 Evans Islands Apt. 683  ...  46.1966 -118.9017      3684   

                 job         dob                         trans_num  \
555718   Media buyer  1993-05-10  1765bb45b3aa3224b4cdcb6e7a96cee3   
55571

##### Observation
As you may have noticed, sorting might not always address the issue that you might to look at keenly, For instance when you only need to perform specific inspection to specific columns. This is attributed by the fact that after sorting the return contains all columns.

In such cases, that's where **subsetting** comes in handy - Returns only the referenced columns.

#### Subsetting