# CRISP - DM Method (Cross-Industry Standard Process for Data Mining)
 The following mnemonic is what I use to remember the order of steps for CRISP - DM Method:
- Barry - Business Understanding
- Drove - Data Understanding
- Directly to the - Data Prep
- Medical - Modelling
- Emergency  - Evaluation
- Department - mDeployment 

## 1- Business Understanding

**Scenario**
- Business has approcahed with some data
- They want us to use the data to __forecast transactions__
- The data contains a list of accounting transactions
- Data contains multiple revenue streams
- They want us to make an __app that predict future transactions__
- Data provided is for __3 years__
- Advised data __quality__ is __okay__

## 2- Data Understanding

- Lets import pandas library and evaluate the quality of the data at hand

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('regression.csv')

In [4]:
# Lets take a look at the first 5 rows of the data
df.head()

Unnamed: 0,Year,Month,Cost Centre,Account,Account Description,Account Type,Amount
0,2019,Jan,CC100,1000000,Product Sales,Revenue,1344.051
1,2019,Jan,CC100,1000001,Licensing Revenue,Revenue,480.968
2,2019,Jan,CC100,1000002,Service Revenue,Revenue,650.82
3,2019,Jan,CC100,1000004,Fee Revenue,Revenue,339.36
4,2019,Jan,CC100,2000000,Cost of Good Sold,Expense,1125.328


In [5]:
# Lets take a look at the last 5 rows of the data
df.tail()

Unnamed: 0,Year,Month,Cost Centre,Account,Account Description,Account Type,Amount
4207,2021,Dec,CC302,2000005,Purchases,Expense,698.121
4208,2021,Dec,CC302,3000000,Cash at Bank,Asset,-282.056
4209,2021,Dec,CC302,3000001,Inventory,Asset,537.478
4210,2021,Dec,CC302,3000002,Accounts Receivable,Asset,1152.68
4211,2021,Dec,CC302,4000001,Accounts Payable,Liability,-1020.0


In [7]:
# Lets take a look at the info for each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4212 entries, 0 to 4211
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year                 4212 non-null   int64  
 1   Month                4212 non-null   object 
 2   Cost Centre          4212 non-null   object 
 3   Account              4212 non-null   int64  
 4   Account Description  4212 non-null   object 
 5   Account Type         4212 non-null   object 
 6   Amount               4212 non-null   float64
dtypes: float64(1), int64(2), object(4)
memory usage: 230.5+ KB


___
### **Notes**
- Looking at the head and tail of the data we can see that the data is from **2019-2021** (3 years of data)
- As you can see that there are total of **7 columns**.
- There are total of **4 categorical variables** (Month, Cost Centre, Account Description, Account type).
- There are **3 numerical variables** (Year, Account, Amount)
- From the data shown we can say for sure that **"Amount"** is our **dependent variable** aka **target variable** and rest of the other variables are **independent variables** aka **features**
- another important thing as we can see is that there are no missing values in the data (woohhf saved a lot of time)

In [10]:
# Lets look for unique values in each columns to see how many categories do we have in each column
# Lets get the names of the columns first
df.columns

Index(['Year', 'Month', 'Cost Centre', 'Account', 'Account Description',
       'Account Type', 'Amount'],
      dtype='object')

In [15]:
# Lets look for unique values in one of the columns first
len(df['Account'].unique())

13

In [16]:
# As you can see there are total 13 unique values in Account column
# Now instead of doing this for every column manually lets iterate through every column to find the unique categories in each column
for col in df.columns:
    print(col, len(df[col].unique()), df[col].unique())

Year 3 [2019 2020 2021]
Month 12 ['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']
Cost Centre 9 ['CC100' 'CC101' 'CC102' 'CC200' 'CC201' 'CC202' 'CC300' 'CC301' 'CC302']
Account 13 [1000000 1000001 1000002 1000004 2000000 2000001 2000002 2000003 2000005
 3000000 3000001 3000002 4000001]
Account Description 13 ['Product Sales' 'Licensing Revenue' 'Service Revenue' 'Fee Revenue'
 'Cost of Good Sold' 'Staff Expenses' 'Technology Expenses'
 'Property Expenses' 'Purchases' 'Cash at Bank' 'Inventory'
 'Accounts Receivable' 'Accounts Payable']
Account Type 4 ['Revenue' 'Expense' 'Asset' 'Liability']
Amount 3956 [1344.051  480.968  650.82  ... -282.056  537.478 1152.68 ]


___
### **Notes**
- **Year** column has **3** unique categories
- **Month** column has **12** unique categories
- **Cost Centre** column has **9** unique Categories
- **Account** column has **13** unique categories
- **Account Descriptions** column has **13** unique categories
- **Account Type** column has **4** unique categories
- **Amount** column has **3956** unique values
___

### **Observation**
- The **Month** column can be converted into a numerical column very easily
- **Account** and **Account Description** columns seems similar, maybe we can drop one of the column, but we have to do further analysis to make sure we dont drop any column which can influesnce our model.
- **Account Type** column can be OneHotEncoded very easily as well
