# Statics with Python

### Key Terms for Data Types
##### Numeric 
   Data that are expressed on a numeric scale 
   ##### Continous 
   Data that can take on any value in an interval. (Synonyms: interval, float, numeric)
   ##### Discrete 
   Data that can take on only integer values, such as counts. (Synonyms: integer, count)
   
#### Cathegorical 
   Data that can take on only a specific set of values representing a set of possible categories. (Synonyms: enums, enumerated, factors, nominal)
   #### Binary 
   A special case of categorical data with just two categories of values, e.g., 0/1, true/false. (Synonyms: dichotomous, logical, indicator, boolean)
   #### Ordinal 
   Categorical data that has an explicit ordering. (Synonym: ordered factor)

### Key Terms for Rectangular Data
#### Data frame
Key Terms for Rectangular Data
Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.
#### Feature
A column within a table is commonly referred to as a feature. Synonyms
attribute, input, predictor, variable
#### Outcome
Many data science projects involve predicting an outcome—often a yes/no out‐ come (in Table 1-1, it is “auction was competitive or not”). The features are some‐ times used to predict the outcome in an experiment or a study.
Synonyms
dependent variable, response, target, output
#### Records
A row within a table is commonly referred to as a record.

### Nonrectangular Data Structures
There are other data structures besides rectangular data.
#### Time series data records 
successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things.
#### Spatial data structures,
which are used in mapping and location analytics, are more complex and varied than rectangular data structures. In the object representation, the focus of the data is an object (e.g., a house) and its spatial coordinates. The field view, by contrast, focuses on small units of space and the value of a relevant metric (pixel brightness, for example).
#### Graph (or network) data structures
are used to represent physical, social, and abstract relationships. For example, a graph of a social network, such as Facebook or LinkedIn, may represent connections between people on the network. Distribution hubs connected by roads are an example of a physical network. Graph structures are useful for certain types of problems, such as network optimization and recommender systems.

### The basic data structure in data science is a rectangular matrix in which rows are records and columns are variables (features).

### Key Terms for Estimates of Location
#### mean 
The sum of all values divided by the number of values.
Synonym(average, Weighted mean)
#### the wigthed mean
The sum of all values times a weight divided by the sum of the weights.
Synonym(weighted average, Median)
#### median
The value such that one-half of the data lies above and below.
Synonym(50th percentile, Percentile)
#### porcentile
The value such that P percent of the data lies below. Synonym(quantile, Weighted median)
#### Weighted median
The value such that one-half of the sum of the weights lies above and below the sorted data.
#### Trimmed mean
The average of all values after dropping a fixed number of extreme values. Synonym(truncated mean)
#### Robust
Not sensitive to extreme values. Synonym(resistant)
#### Outlier
A data value that is very different from most of the data.
Synonym(extreme value)

In [1]:
import pandas as pd

In [8]:
dict_population = {'State':['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware'], 'Population': [4779736, 710231, 6392017, 291591, 37253956, 5029196, 3574097, 897934], 'Muderate': [5.7, 5.6, 4.7, 5.6, 4.4, 2.8, 2.4, 5.8], 'Abbreviation': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE']}
df_population = pd.DataFrame(dict_population)
df_population

Unnamed: 0,State,Population,Muderate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,291591,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


In [12]:
print('the mean of population is: ', df_population['Population'].mean())
print('the median of population is: ', df_population['Population'].median())

the mean of population is:  7366094.75
the median of population is:  4176916.5


### Note: I gonna use a database to expose some concepts of Statics :)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import mysql.connector

try:
    connection = mysql.connector.connect(host='localhost',
                                         database='Universities',
                                         user='root',
                                         password='bebo3000')

    sql_select_Query = "select university_ranking_year.university_id, university_ranking_year.score, university_ranking_year.year, university.university_name from university_ranking_year inner join university on university_ranking_year.university_id = university.id "
    cursor = connection.cursor()
    cursor.execute(sql_select_Query)
    # get all records
    university = cursor.fetchall()
    #print("Total number of rows in table: ", cursor.rowcount)

    #print("\nPrinting each row")
    '''
    for row in university:
        print("Id = ", row[0], )
        print("country_id = ", row[1])
        print("university_name  = ", row[2])'''
    df_sql_data = pd.DataFrame(university)
except mysql.connector.Error as e:
    print("Error reading data from MySQL table", e)
finally:
    if connection.is_connected():
        connection.close()
        cursor.close()
        print("MySQL connection is closed")

MySQL connection is closed


In [4]:
df_sql_data[df_sql_data[2]==2011]

Unnamed: 0,0,1,2,3
0,1,100.0,2011,Harvard University
1,5,98.0,2011,California Institute of Technology
2,2,98.0,2011,Massachusetts Institute of Technology
3,3,98.0,2011,Stanford University
4,6,91.0,2011,Princeton University
...,...,...,...,...
11722,74,25.0,2011,Nagoya University
11723,119,25.0,2011,University of Bonn
11724,94,24.0,2011,University of Sydney
11725,78,24.0,2011,Case Western Reserve University


In [5]:
df_ranking2011 = df_sql_data[df_sql_data[2]==2011]

In [8]:
binnedRanking = pd.cut(df_ranking2011[1], 10)
binnedRanking

0        (90.0, 100.0]
1        (90.0, 100.0]
2        (90.0, 100.0]
3        (90.0, 100.0]
4        (90.0, 100.0]
             ...      
11722     (20.0, 30.0]
11723     (20.0, 30.0]
11724     (20.0, 30.0]
11725     (20.0, 30.0]
11726     (20.0, 30.0]
Name: 1, Length: 1607, dtype: category
Categories (10, interval[float64]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 30.0] < (30.0, 40.0] ... (60.0, 70.0] < (70.0, 80.0] < (80.0, 90.0] < (90.0, 100.0]]