![IT%20Logo.png](attachment:IT%20Logo.png)

<div class="alert alert-block alert-info"><font size="5"><center><b><u>Introduction to Variables</u></b></center></font>

\begin{align*}Alex\:Kumenius\end{align*}
\begin{align*}Business\hspace{2mm}Intelligence\hspace{2mm}and\hspace{2mm}Data\hspace{2mm}Scientist\hspace{2mm}Project\hspace{2mm}Integrator\end{align*}
$%$       
\begin{align*}Date : Gener\hspace{2mm}2021\end{align*}</div>

# <span style=color:darkblue>TYPES OF VARIABLES</span>

Generally, in Math and Statistics <span style=color:blue>variables</span> may be <span style=color:blue>Numerical</span> or <span style=color:blue>Categorical Variables</span>.

A <span style=color:blue>Variable</span> is a <u>quantity whose value changes</u>. 

![Types%20of%20Variables.jpg](attachment:Types%20of%20Variables.jpg)

### <span style=color:darkred>NUMERICAL</span>

A <span style=color:blue>Numerical variable</span> can take a wide range of numerical values, and it is sensible to <span style=color:blue><em>add, subtract</em></span>, or take <span style=color:blue><em>averages</em></span> with those <span style=color:blue>values</span>. On the other hand, we would <b><u>not</u></b> classify a variable reporting <em>"<u>telephone area codes</u>"</em> as <span style=color:blue>numerical</span> since there is **no** sense to *average, sum*, and *difference*.

#### <span style=color:gray><u>Discrete Variables</u></span>

A <span style=color:blue><b><u>Discrete variable</u></b></span> is a <span style=color:blue>variable whose value is obtained by <u>counting</u>.</span> 

Over a particular range of real values ($\:\mathbb {R}\:$) is any value in the range that the variable is permitted to take on, there is a positive minimum distance to the nearest other permissible value. The number of permitted values is either <span style=color:blue><b><em><u>finite or countably infinite</u></em></b>. 

Common examples are variables that must be <span style=color:blue><em>integers, non-negative integers, positive integers</em></span>, or <span style=color:blue><em>only the integers 0 and 1</em></span>.

<u><b>Examples:</b></u> 
- number of students present
- number of red marbles in a jar
- number of heads when flipping three coins
- students’ grade level

#### <span style=color:gray><u>Continuous Variables</u></span>

A <span style=color:blue><b><u>Continuous variable</u></b></span> is a <span style=color:blue>variable whose value is obtained by measuring.</span>

A <span style=color:blue><b><u>Continuous variable</u></b></span> is one which can take on <span style=color:blue>infinitely many, uncountable values</span>.

For example, a variable over a non-empty range of the real numbers ($\:\mathbb {R}\:$) ***a*** and ***b*** is <span style=color:blue><b><u>continuous</u></b></span>, if it can take on <span style=color:blue>*any value in that range*</span>. The reason is that any range of real numbers between ***a*** and ***b*** with$\hspace{3mm}$${\displaystyle a,b\in \mathbb {R} ;   
\hspace{4mm}a\neq b}$ is infinite and uncountable.

<u><b>Examples:</b></u> 
- height of students in class
- weight of students in class
- time it takes to get to school
- distance traveled between classes

### <span style=color:darkred>CATEGORICAL</span>

A <span style=color:blue><b><u>Categorical Variable</u></b></span> takes on a limited, and usually fixed, number of possible values, categories;  and the possible values are call the <span style=color:blue><em><u>variable's <b>levels</b>.</u></em></span>

<span style=color:blue>Categorical Variables</span> where their <span style=color:blue>levels</span> have a natural order are <span style=color:blue><b><u>"Ordinal Variables"</u></b></span>. 

1- The ``categories`` are <span style=color:blue><ins> deduced from the data </ins> </span>.   
2- The ``categories`` are <span style=color:blue><ins> messy </ins> </span>.   

Examples are *gender, social class, blood type, country affiliation, observation time or rating via Likert scales*.   

<span style=color:blue>Categorical Variables</span> <span style=color:red><b>without</b></span> this type of special ordering is called <span style=color:blue><b><u>"Nominal Variable"</u></b></span>.

<span style=color:blue>Categorical data</span> might have an order (e.g. <em>‘strongly agree’ vs ‘agree’</em> or <em>‘first observation’ vs. ‘second observation’</em>), but **numerical operations** (additions, divisions, …) are not possible.

#### <span style=color:blue>Categorical - "Ordinal Variables"</span>

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df

In [None]:
type(df)

In [None]:
df.dtypes

In [None]:
# passing astype('category'), as the default behavior
df["B"] = df["A"].astype('category')
df

In [None]:
type(df)

In [None]:
df.dtypes

In [None]:
df['A']

In [None]:
df['B']

#### <span style=color:blue>Categorical - "Nominal Variables"</span>

How we <span style=color:blue><b>control the behavior</b></span> of a <span style=color:blue>Categorical - "Nominal Variable"</span>?.

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
s = pd.Series(["Wednesday", "Monday", "Thursday", "Sunday", "Friday"])
s
s.sort_values(inplace=True)
s

In [None]:
sc = pd.Series(["Wednesday", "Saturday", "Monday", "Sunday", "Thursday", "Tuesday", "Friday"], 
               dtype="category")
sc
sc.sort_values(inplace=True)
sc

In [None]:
cat_s = CategoricalDtype(categories=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"], 
                         ordered=True)
cat_s

In [None]:
s_cat =s.astype(cat_s)
s_cat.sort_values(inplace=True)
s_cat

The <span style=color:blue><b>categorical data</b></span> type is <b>useful</b> in the following ``cases``:

- <span style=color:blue><b>A string variable consisting of only a few different values</b></span>.   
<i><span style=color:blue>Converting</span> such a <span style=color:blue><b>string variable</b></span> to a <span style=color:blue><b>categorical variable</b></span> will save some <b>memory</b></i>.   
$%$
- <span style=color:blue><b>The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”)</b></span>.   
<i>By <span style=color:blue>converting</span> to a <span style=color:blue>categorical</span> and specifying an <span style=color:blue>order on the categories, <u>sorting</u> and <u>min/max</u></span> <b>will use</b> the <span style=color:blue><u>logical order</u></span> <span style=color:red><b>instead</b></span> of the <span style=color:blue><u>lexical order</u></span></i>.   
$%$
- <span style=color:blue><b>As a signal to other Python libraries that this column should be treated as a categorical variable</b></span>   
<i>e.g. to use suitable <span style=color:blue>statistical methods</span> or <span style=color:blue>plot types</span></i>.

![IT%20Logo.png](attachment:IT%20Logo.png)

# <span style=color:darkblue><ins>Detectando y Filtrando Outliers</ins></span>

Filtrar o transformar <span style=color:blue><b>Outliers - Valores Atípicos</b></span> es en gran medida la aplicación de operaciones de ``matriz - arrays``.

In [None]:
data = pd.DataFrame(np.random.randn(1000, 4))

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.describe()

Queremos encontrar en una de las columnas, <span style=color:blue><u>valores</u> que <b>contengan</b> el número <b>3</b></span> en ``valor absoluto``.

In [None]:
col = data[1]

In [None]:
col[np.abs(col) > 3]

Para selecionar todas las <span style=color:blue>observaciones - cases</span> con valores que <span style=color:red>excedan los <b>limites ó rangos</b></span> ``3`` o ``-3``, usaremos el método <span style=color:blue><b>any()</b></span> en un DataFrame Boleano :  

In [None]:
data[(np.abs(data) > 3).any(1)]

<span style=color:red>Los valores se pueden establecer en función de estos criterios</span>. 
    
Si queremos limitar los valores del intervalo a ``–3`` a ``3``:

In [None]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [None]:
np.sign(data)

In [None]:
data.head()

In [None]:
# observamos 'min' y 'max' en el resumen estadístico,
# no supera el intervalo -3 y 3
data.describe()

La declaración np.sign(data) produce valores ``1`` y ``-1``, basandose si los valores en <span style=color:blue>data</span> son positivos o negativos :

In [None]:
np.sign(data).head()

In [None]:
(np.sign(data) * 3).head()

# <span style=color:darkblue><u>Computing Indicator/Dummy Variables</u></span>

Otro tipo de transformación para aplicaciones de <i>Statistical Modeling</i> o <i>Machine Learning</i> es convertir una variable <span style=color:blue>Categórica</span> en una matriz <span style=color:blue><b>"dummy / ficticia"</b> o <b>"indicator / indicadora"</b></span>. 

Si una columna en un DataFrame tiene <b>k</b> valores distintos, derivaría una matriz o DataFrame con <b>k</b> columnas que contienen todos los <span style=color:blue><b>1s</b> y <b>0s</b></span>. Pandas tiene una función <span style=color:blue><b>get_dummies()</b></span> para hacerlo :

## <span style=color:blue>pd.get_dummies() method</span>

In [None]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

In [None]:
type(df)

In [None]:
df.shape

In [None]:
df_dummies = pd.get_dummies(df['key'])
df_dummies

In [None]:
df_dummies.shape

Podemos agregar un prefijo a las columnas en el dummy DataFrame, que luego se puede combinar con los otros datos. <span style=color:blue>get_dummies()</span> tiene un argumento <span style=color:darkred><b>prefix</b></span> para hacer esto:

In [None]:
dummies = pd.get_dummies(df['key'], prefix = 'cod')
dummies

In [None]:
dummies.shape

In [None]:
df_con_dummies = df[['data1']].join(dummies)
df_con_dummies

In [None]:
df_con_dummies.shape

In [None]:
os.getcwd()

In [None]:
os.chdir('D:\\Documents\\Python\\Python for Data Analysis-Pandas Jupyter Notebook\\pydata-Notebooks\\datasets\\movielens')

In [None]:
os.listdir()

In [None]:
mcabecera = ['movie_id', 'titulo', 'genero']
mcabecera

In [None]:
movies = pd.read_table('movies.dat', sep = '::', header = None, names = mcabecera)
movies.head()

In [None]:
movies.shape

In [None]:
movies.describe()

Agregar <span style=color:blue>dummy variables</span> para cada género requiere un poco de transformación. 

Primero, extraemos la lista de géneros únicos en el dataset:

In [None]:
todos_generos = []
todos_generos

In [None]:
for x in movies.genero:
    todos_generos.extend(x.split('|'))

In [None]:
type(todos_generos)

In [None]:
todos_generos[:8]

In [None]:
len(todos_generos)

In [None]:
generos = pd.unique(todos_generos)
generos

In [None]:
len(generos)

In [None]:
movies.head(10)

Para construir un <span style=color:blue>Dummy</span> DataFrame, se empieza creando una matriz/array <b>'zeros'</b>, para finalmente crear un DaFrame de <b>'zeros'</b> :

In [None]:
len(movies)

In [None]:
cero_matriz = np.zeros((len(movies), len(generos)))
cero_matriz.shape

In [None]:
cero_matriz

In [None]:
sum(cero_matriz)

In [None]:
len(sum(cero_matriz))

In [None]:
dummies = pd.DataFrame(cero_matriz, columns = generos)
dummies.head()

In [None]:
dummies.sum()

In [None]:
dummies.describe()

Ahora, <span style=color:blue><b>iteramos</b> cada película</span> y configuramos las entradas en cada fila de <span style=color:blue>dummies</span> a <b>1</b>. Para hacer esto, usamos <span style=color:blue>dummies.columns</span> para calcular los índices de columna para cada género:

In [None]:
dummies.columns

In [None]:
gen = movies.genero[0]
gen

In [None]:
gen.split('|')

In [None]:
dummies.columns.get_indexer(gen.split('|'))

Ahora, podemos utilizar <span style=color:blue>.iloc</span> para establecer valores basados en estos índices :

In [None]:
for i, gen in enumerate(movies.genero):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [None]:
movies.head()

In [None]:
dummies.head()

In [None]:
dummies.sum()

Finalmente podemos combinar <b>'dummies'</b>, con <b>'movies'</b>

In [None]:
movies_dummies = movies.join(dummies.add_prefix('Genero_'))
movies_dummies.head()

In [None]:
movies_dummies.iloc[1]

In [None]:
np.random.seed(12345)

In [None]:
values = np.random.rand(10)
values

In [None]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
bins

In [None]:
pd.get_dummies(pd.cut(values, bins))