In [None]:
print("hello world")

hello world


In [2]:
import pandas as pd
import numpy as np

### Categories, Vectorization in Python, Statistics and probability  concepts for data science

### Categorical data

Categoricals are a pandas data type corresponding to categorical variables in statistics. 

A categorical variable takes on a limited, and usually fixed, number of possible values
 
Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

Categorical data might have an order but cannot perform numerical operation.


Object creation

Series creation

Categorical Series or columns in a DataFrame can be created in several ways:

By specifying dtype="category" when constructing a Series:

In [3]:

s = pd.Series(["a", "b", "c", "a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']

# By converting an existing Series or column to a category dtype:


In [4]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype("category")
df

Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


# pd.Categorical

Using the standard pandas Categorical constructor, we can create a category object.

pandas.Categorical(values, categories, ordered)

In [5]:
# pandas.Categorical(values, categories, ordered)
import pandas as pd
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
print(cat)

['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']


In [7]:
# Categories (3, object): [a, b, c]

import pandas as pd
cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
print(cat)

['a', 'b', 'c', 'a', 'b', 'c', NaN]
Categories (3, object): ['c', 'b', 'a']


Categories (3, object): [c, b, a]

Here, the second argument signifies the categories. Thus, any value which is not present in the categories will be treated as NaN.

In [8]:
import pandas as pd
cat =pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
print(cat)

['a', 'b', 'c', 'a', 'b', 'c', NaN]
Categories (3, object): ['c' < 'b' < 'a']


Categories (3, object): [c < b < a]

Logically, the order means that, a is greater than b and b is greater than c.

Description

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or 

DataFrame. 

It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.

In [10]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})
print(df.describe())
print(df["cat"].describe())

       cat  s
count    3  3
unique   2  2
top      c  c
freq     2  2
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object


Working with categories


    Categorical data has a categories and a ordered property, which list their possible values and whether the ordering matters or not. 

    These properties are exposed as s.cat.categories and s.cat.ordered. 

    If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

In [11]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s.cat.categories

Index(['a', 'b', 'c'], dtype='object')

In [12]:
# It’s also possible to pass in the categories in a specific order:
s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
s.cat.categories

Index(['c', 'b', 'a'], dtype='object')

In [13]:
s.cat.ordered


False

# Unique()

The unique() function is used to get unique values of Series object.

Uniques are returned in order of appearance.


In [14]:
import numpy as np
import pandas as pd
pd.Series([2, 4, 3, 3], name='P').unique()

array([2, 4, 3], dtype=int64)

# Renaming Categories

Renaming categories is done by assigning new values to the Series.cat.categories property or by using the rename_categories() method:


In [15]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
s


AttributeError: property 'categories' of 'Categorical' object has no setter

In [16]:
s = s.cat.rename_categories([1, 2, 3])
s


0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

In [17]:
# You can also pass a dict-like object to map the renaming
s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
s


0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

# Appending new categories

Appending categories can be done by using the add_categories() method:



In [18]:
s = s.cat.add_categories([4])
s.cat.categories


Index(['x', 'y', 'z', 4], dtype='object')

# Removing categories

Removing categories can be done by using the remove_categories() method. Values which are removed are replaced by np.nan.:


In [19]:
s = s.cat.remove_categories([4])
s

0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): ['x', 'y', 'z']

# Removing unused categories


In [20]:
s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
s


0    a
1    b
2    a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [21]:
s.cat.remove_unused_categories()


0    a
1    b
2    a
dtype: category
Categories (2, object): ['a', 'b']

# Cut()

By using special functions, such as cut(), Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data. 



In [22]:
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)


Unnamed: 0,value,group
0,8,0 - 9
1,62,60 - 69
2,57,50 - 59
3,14,10 - 19
4,82,80 - 89
5,85,80 - 89
6,65,60 - 69
7,89,80 - 89
8,74,70 - 79
9,60,60 - 69


# Vectorization in python

We know that most of the application has to deal with a large number of datasets

To make sure that the code is computationally efficient, we will use vectorization

Time complexity in the execution of any algorithm is very crucial deciding whether an application is reliable or not. 

To run a large algorithm in as much as optimal time possible is very important when it comes to real-time application of output. 

To do so, Python has some standard mathematical functions for fast operations on entire arrays of data without having to write loops. One of such library which contains such function is numpy.


# What is Vectorization ?

Vectorization is used to speed up the Python code without using loop. 

Using such a function can help in minimizing the running time of code efficiently. 

Various operations are being performed over vector such as dot product of vectors which is also known as scalar product as it produces single output.

Outer products which results in square matrix of dimension equal to length X length of the vectors.

Element wise multiplication which products the element of same indexes and dimension of the matrix remain unchanged.


We will see how the classic methods are more time consuming than using some standard function by calculating their processing time.

        outer(a, b): Compute the outer product of two vectors.

        multiply(a, b): Matrix product of two arrays.

        dot(a, b): Dot product of two arrays.

        zeros((n, m)): Return a matrix of given shape and type, filled with zeros.

        process_time(): Return the value (in fractional seconds) of the sum of the system and user CPU time of the current process. It does not include time elapsed during sleep.


# Dot Product:

Dot product is an algebraic operation in which two equal length vectors are being multiplied such that it produces a single number. Dot Product often called as inner product.



In [23]:
import time
import numpy
import array
a = array.array('q')
for i in range(100000):
    a.append(i);
b = array.array('q')
for i in range(100000, 200000):
    b.append(i)
# classic dot product of vectors implementation 
tic = time.process_time()
dot = 0.0;
for i in range(len(a)):
      dot += a[i] * b[i]


In [24]:
toc = time.process_time()
print("dot_product = "+ str(dot));
print("Computation time = " + str(1000*(toc - tic )) + "ms")  
n_tic = time.process_time()
n_dot_product = numpy.dot(a, b)
n_toc = time.process_time()
print("\nn_dot_product = "+str(n_dot_product))
print("Computation time = "+str(1000*(n_toc - n_tic ))+"ms")



dot_product = 833323333350000.0
Computation time = 109.375ms

n_dot_product = 833323333350000
Computation time = 0.0ms


# Outer Product:

The tensor product of two coordinate vectors is termed as Outer product. 

Let’s consider two vectors a and b with dimension n x 1 and m x 1 then the outer product of the vector results in a rectangular matrix of n x m.

If two vectors have same dimension then the resultant matrix will be a square matrix


In [26]:
import time
import numpy
import array
a = array.array('i')
for i in range(200):
    a.append(i);
b = array.array('i')
for i in range(200, 400):
    b.append(i)
# classic outer product of vectors implementation 
tic = time.process_time()
outer_product = numpy.zeros((200, 200))
for i in range(len(a)):
   for j in range(len(b)):
        outer_product[i][j]= a[i]*b[j]
        toc = time.process_time()
print("outer_product = "+ str(outer_product));
print("Computation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
outer_product = numpy.outer(a, b)
n_toc = time.process_time()
print("outer_product = "+str(outer_product));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")



outer_product = [[    0.     0.     0. ...     0.     0.     0.]
 [  200.   201.   202. ...   397.   398.   399.]
 [  400.   402.   404. ...   794.   796.   798.]
 ...
 [39400. 39597. 39794. ... 78209. 78406. 78603.]
 [39600. 39798. 39996. ... 78606. 78804. 79002.]
 [39800. 39999. 40198. ... 79003. 79202. 79401.]]
Computation time = 218.75ms
outer_product = [[    0     0     0 ...     0     0     0]
 [  200   201   202 ...   397   398   399]
 [  400   402   404 ...   794   796   798]
 ...
 [39400 39597 39794 ... 78209 78406 78603]
 [39600 39798 39996 ... 78606 78804 79002]
 [39800 39999 40198 ... 79003 79202 79401]]

Computation time = 0.0ms


# Element wise Product:

Element-wise multiplication of two matrices is the algebraic operation in which each element of first matrix is multiplied by its corresponding element in the later matrix.

Dimension of the matrices should be same.

Consider two matrices a and b, index of an element in a is i and j then a(i, j) is multiplied with b(i, j) respectively 

In [27]:
import time
import numpy
import array
a = array.array('i')
for i in range(50000):
    a.append(i);
b = array.array('i')
for i in range(50000, 100000):
    b.append(i)
# classic element wise product of vectors implementation 
vector = numpy.zeros((50000))
tic = time.process_time()
for i in range(len(a)):
      vector[i]= a[i]*b[i]


In [28]:
toc = time.process_time()
print("Element wise Product = "+ str(vector));
print("\nComputation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
vector = numpy.multiply(a, b)
n_toc = time.process_time()
print("Element wise Product = "+str(vector));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")


Element wise Product = [0.00000000e+00 5.00010000e+04 1.00004000e+05 ... 4.99955001e+09
 4.99970000e+09 4.99985000e+09]

Computation time = 31.25ms
Element wise Product = [        0     50001    100004 ... 704582713 704732708 704882705]

Computation time = 0.0ms


In [30]:
# Statistics and probability  concepts for data science

Data is the information collected through different sources which can be qualitative or quantitative in nature.

Mostly, the data collected is used to analyze and draw insights on a particular topic.

For example:

1. Cylinder size, mileage, color, etc. for the sale of a car
2.If the cells in the body are malignant or benign to detect Cancer 

Types of Data

        Numerical Data

            Numerical data is the information in numbers i.e. numeric which poses as a quantitative measurement of things.

            For example:
            Heights and weights of people
            Stock Prices

        Discrete Data

            Discrete data is the information that often counts of some event i.e. can only take specific values. These are often integer-based, but not necessarily.

            For example:
            Number of times a coin was flipped
            The number of customers who have bought different products
            This data type is mainly used for simple statistical analysis because it’s easy to summarize and compute.

        Continuous Data

            Continuous Data is the information that has the possibility of having infinite values i.e. can take any value within a range.

            For example:
            How many centimeters of rain fell on a given day

        Categorical Data

            This type of data is qualitative in nature which has no inherent mathematical significance. It is sort of a fixed value under which a unit of observation is assigned or “categorized”.
            For example:
            Gender
            Binary Data (Yes/No)
            Attributes of a vehicle like color, mileage, number of doors, etc.






Ordinal Data

This type of data is the combination of numerical and categorical data i.e. categorical data having some mathematical significance.

For example:

Restaurant ratings from 1-5, 1 being the lowest and 5 being the highest





# STATISTICS:

### Mean, Median and Mode

### Mean
In mathematics and statistics, the mean is the average of the numerical observations which is equal to the sum of the observations divided by the number of observations.



### Median

The median of the data, when arranged in ascending or descending value is the middle observation of the data i.e. the point separating the higher half to the lower half of the data.

To calculate the median:

Arrange the data in ascending or descending order.
an odd number of data points: the middle value is the median.
even number of data points: the average of the two middle values is the median.


Mode

The mode of a set of data points is the most frequently occurring value.
For example:
5,2,6,5,1,1,2,5,3,8,5,9,5 are the set of data points. Here 5 is the mode because it’s occurring most frequently.


The steps of calculating variance using an example:


The steps of calculating variance using an example:
Let’s find the variance of (1,4,5,4,8)


Find the mean of the data points i.e. (1 + 4 + 5 + 4 + 8)/5 = 4.4

Find the differences from the mean i.e. (-3.4, -0.4, 0.6, -0.4, 3.6)

Find the squared differences i.e. (11.56, 0.16, 0.36, 0.16, 12.96)

Find the average of the squared differences i.e. 11.56+0.16+0.36+0.16+12.96/5=5.04

The formula for the same is:

# Standard Deviation

Standard deviation is a measure of dispersement in statistics. “Dispersement” tells you how much your data is spread out. Specifically, it shows you how much your data is spread out around the mean or average. For example, are all your scores close to the average? Or are lots of scores way above (or way below) the average score?

Population Data V/s Sample Data

Population data refers to the complete data set whereas sample data refers to a part of the population data which is used for analysis. Sampling is done to make analysis easier.
When using sample data for analysis, the formula of variance is slightly different. If there are total n samples we divide by n-1 instead of n:



# PROBABILITY

Probability denotes the possibility of something happening. It is a mathematical concept that predicts how likely events are to occur. The 

probability values are expressed between 0 and 1. The definition of probability is the degree to which something is likely to occur. This fundamental theory of probability is also applied to probability distributions.

For example:
The probability of the coin showing heads when it’s flipped is 0.5.


# Conditional Probability

Conditional probability is the probability of an event occurring provided another event has already occurred.

For example:

The students of a class have given two tests of the subject mathematics. In the first test, 60% of the students pass while only 40% of the students passed both the tests. What percentage of students who passed the first test, cleared the second test?
