## Questions to take into account:

### What is "data"?

You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. Instead, we're going to go with the colloquial definition of data as **"something whose value you care about”**. You won't find that in any formal treatment of the subject, but for now, it is good enough. Your name, age, and telephone number are data about you. Your bank savings, your address, and your parents' names are data that relate to you. We have data about everything, everywhere.

## What is data science? (10 mts)

Now that we know what data is, we can now ask: "What is data science?" Science, in the language of the scientific method, is:

1. Formulating hypotheses, or guesses about how the world works, based on observations of the world around us
2. Validating or invalidating those hypotheses by conducting experiments

Unlike the pure sciences though, working with data doesn't necessarily require conducting experiments (although it could!). Rather, many times the data has already been collected and organized by someone else. So the scientific method, as applied to data, can be summarized as: **"Formulating hypotheses based on the world around us, then analyzing relevant data to validate or invalidate our hypotheses."**

## What is machine learning?

However, choice (d) of Exercise 1, as well as examples 1 and 2 from the above list, do fall into the bucket of **machine learning**. What does this mean?

"Learn" means to "gain or acquire knowledge or skill in something via experience." So one could frame "machine learning" as "how a machine gains or acquires knowledge via experience." How does a machine gain experience? All machine inputs are essentially binary strings of 0s and 1s, which is really just – you guessed it – data! So machine learning is really just **how a computer acquires knowledge via data**.

Of course, this gives no insight into the "how" at all; it just says that there is soemething that is done with input data to generate this knowledge as an output. To make a math analogy, machine learning is some function $f$ such that

> $ knowledge = f(data)$

and other than that there are no other real stipulations on $f$! So $f$ could be as mechanical as a simple mathematical function (say, the sum of all the data points) and qualify as machine learning. And in practice, this is what most of the common machine learning algorithms are, including:

1. Logistic regression
2. Random forests
3. Support vector machines
4. $k$ - means clustering
5. Neural networks

(You will learn about all of these later in the course.) This may seem disappointing, given how the media hypes up "artificial intelligence" and makes it seem like there is something "smart" going on with machine learning, but in fact many mechanical methods satisfy the conditions required to be classified as machine learning. This doesn't mean these mechanical methods are limited in usefulness – in fact, they are quite powerful if used properly – but it does mean that they don't resemble anything that we would naturally associate with human-like intelligence.

## What is artificial intelligence? 

But the elephant is still in the room: even though some mechanical, "dumb" methods may qualify as machine learning, this doesn't exclude human-like, "smart" methods from being classified as such either. And semantically, this is completely true – it doesn't, yet people have chosen to name it something else entirely: **artificial intelligence**.

But why? Why give "smart" methods an entirely different name if they can also fall under the bucket of machine learning? That is the question we will explore for the remainder of this module.

Let's start by taking a look at an iconic demonstration of this so-called intelligence: AlphaGo beating the world's top human Go player.

### What can machines do and not do? What about us? 

So, is there any sensible test we could use to determine if something is as intelligent as a human? There have been many proposals over time. The most famous aptitude test developed was the **Turing test**, named after the English mathematician and famous World War II cryptographer Alan Turing. In this exam, there is a human evaluator and two conversation partners: one machine and one human. The evaluator would conduct a conversation with each through a text-only channel. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test.

Turing did not explicitly state that his test could be used as a measure of intelligence, but nonetheless many who came after him thrust his test into the limelight. Of course, the corollary of this is that if a computer can converse like a human, then it is effectively as intelligent as a human.

## Conclusions & Takeaways

In this discussion, you learned what "data science" and "machine learning" really are, in contrast to the misleading connotations that they are often given in public discussion. You also learned that "artificial intelligence" is a very murky term – nobody really knows what it is, and it's unclear if in its current focus on imitating human intelligence, that it is even a good use of our time and effort.

Throughout this program, we will focus primarily on data science and machine learning, not so much artificial intelligence. Yet the philosophical questions surrounding artificial intelligence are fascinating, and we encourage you to continue pondering them as you become more involved in this new and exciting field.

# Setting Colab, and Getting data from path in MyDrive (For Security Reasons some of this was modified)

In [2]:
import os
base_dir = "~/My_projects/Budget_program/"
print(base_dir)

~/My_projects/Budget_program/


## Objetivo del Notebook



**Business Context.** 

**Business Problem.** 




**Analytical Context.** Las fuentes a usar son:

*   Archivos de transacciones del banco X

Se quiere al final de este ejercicio tener una solucion capaz de:

1.   Recibir las fuentes organizadas
2.   Generar una vista minable
3.   Como resultado 1 se tenga => una analisis descriptivo de las transacciones
4.   Como resultado 1 se tenga => una analisis de segmentacion de los gastos

**Anotaciones:**

-Para llenar

## Links de apoyo:

### Parte 1:

[Introduction to Python](https://developers.google.com/edu/python/introduction)

[Uso de paths](https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory)

[Introduction to regex:GLink1](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)

[Introduction to regex:GLink2](https://developers.google.com/edu/python/regular-expressions)

[Regex:GLink3](https://realpython.com/regex-python/)

[Regex:GLink4](https://www.dataquest.io/blog/regex-cheatsheet/)

[Regex Tester](https://regex101.com/)

### Parte 2:

[Built In Format in Python](https://docs.python.org/release/3.1.3/library/stdtypes.html#str.format)

### Parte 3:

[Support of EDA](https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e)

[Support of missing data](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b)

[Support of visualization](https://towardsdatascience.com/5-advanced-visualisation-for-exploratory-data-analysis-eda-c8eafeb0b8cb)

## Setting up modules to process data

### Variables in dataset (Dictionary of variables)

The names of the variables are the following:

*   FECHA: Date in which the transaction was done
*   DOCUMENTO: Name related to internal proccesses in the bank
*   OFICINA: Name of the office or place in which a transaction was made from
*   DESCRIPCION: Description of the transaction
*   REFERENCIA: Number or description related to a entity that a transaction was made to pay or receive
*   VALOR: Value of the transaction



In [3]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from datetime import datetime

# Processing files
import re

In [4]:
# Functions to help with tasks
def getting_colnames_asrows(df):
  for colname in list(df.columns):
    print(colname)
def getting_names_asrows(list):
  for item in list:
    print("\""+item+"\""+",")

In [5]:
# Step1: Get path of data
# Step2.0: Walk in directory
# Step2.1: Check encoding of .xlsx, .csv
# Step3.0: Generating dataframe in dict
# Step3.1: Check to parse important variable
# Step4: Make loading function
# Step5: Getting dataframe for pandas profiling
data_path = base_dir + "project/data/"


In [7]:
# Generating dict of dataframes
dictFiles = {}
files = os.listdir(data_path)
for filename in files:
  match = re.search("_(\w{3})_\w{4}",str(filename))
  if match:
    print(filename)
    #print(match.group())

    filepath = str(data_path+filename)
    #wb = xlrd.open_workbook(filepath, encoding_override='CORRECT_ENCODING')
    #df = pd.read_excel(wb)

    print(filepath)
    dfTemp = pd.read_excel(filepath)
    dictFiles[filename] = dfTemp

640-212335-18_Ene_2021.xlsx
/content/gdrive/My Drive/Colab Notebooks/My_projects/Budget_program/project/data/640-212335-18_Ene_2021.xlsx
640-212335-18_Feb_2021.xlsx
/content/gdrive/My Drive/Colab Notebooks/My_projects/Budget_program/project/data/640-212335-18_Feb_2021.xlsx


In [15]:
## Doing loop of dict of dfs to check columns names and sizes
for key_month in dictFiles:
  print("\nChecking %s" % key_month )
  getting_colnames_asrows(dictFiles[key_month])
  print(dictFiles[key_month].shape)
  print(dictFiles[key_month].dtypes)
  #print(dictFiles[key_month].head())
  print(dictFiles[key_month].nunique(axis=0)) #To check uniques


Checking 640-212335-18_Ene_2021.xlsx
FECHA
DOCUMENTO
OFICINA
DESCRIPCIÓN
REFERENCIA
VALOR
(36, 6)
FECHA          datetime64[ns]
DOCUMENTO             float64
OFICINA                object
DESCRIPCIÓN            object
REFERENCIA             object
VALOR                 float64
dtype: object
FECHA          13
DOCUMENTO       0
OFICINA         5
DESCRIPCIÓN    16
REFERENCIA      4
VALOR          35
dtype: int64

Checking 640-212335-18_Feb_2021.xlsx
FECHA
DOCUMENTO
OFICINA
DESCRIPCIÓN
REFERENCIA
VALOR
(23, 6)
FECHA          datetime64[ns]
DOCUMENTO             float64
OFICINA                object
DESCRIPCIÓN            object
REFERENCIA             object
VALOR                 float64
dtype: object
FECHA          11
DOCUMENTO       0
OFICINA         2
DESCRIPCIÓN    10
REFERENCIA      5
VALOR          23
dtype: int64


## Getting Pandas Profiling Report with loop

In [9]:
#!pip install -U https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [12]:
def making_report_from_dict(dictFiles,base_dir):
  
  for key_month in dictFiles:
    print("\nProcessing %s" % key_month + "with size" + str(dictFiles[key_month].shape[0]))
    print("Initial path is %s" % base_dir)
    reportname = str(key_month)
    profile = ProfileReport(dictFiles[key_month],title="Pandas Profiling Report",minimal=True)
    profile.to_file(base_dir+"project/html_reports/raw_%s_full.html" % (reportname))

In [13]:
making_report_from_dict(dictFiles=dictFiles,base_dir=base_dir)


Processing 640-212335-18_Ene_2021.xlsxwith size36
Initial path is /content/gdrive/My Drive/Colab Notebooks/My_projects/Budget_program/


Summarize dataset:   0%|          | 0/14 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]


Processing 640-212335-18_Feb_2021.xlsxwith size23
Initial path is /content/gdrive/My Drive/Colab Notebooks/My_projects/Budget_program/


Summarize dataset:   0%|          | 0/14 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Temporary Tests 

In [113]:
# Using listdir to see list of files and appply the regex
files = os.listdir(data_path)
for filename in files:
  print(filename)

640-212335-18_Ene_2021.xlsx
640-212335-18_Feb_2021.xlsx


In [None]:
import xlrd
book = xlrd.open_workbook("myfile.xls")
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
    print(sh.row(rx))