In [1]:
import datetime
import numpy as np
import pandas as pd

import plotly.graph_objects as go
from ipywidgets import widgets

ModuleNotFoundError: No module named 'plotly'

In [None]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/yankev/testing/master/datasets/nycflights.csv')
df = df.drop(df.columns[[0]], axis=1)

In [2]:
!pip install nycflights13


Collecting nycflights13
  Downloading nycflights13-0.0.3.tar.gz (8.7 MB)
[K     |████████████████████████████████| 8.7 MB 58.5 MB/s eta 0:00:01
Building wheels for collected packages: nycflights13
  Building wheel for nycflights13 (setup.py) ... [?25ldone
[?25h  Created wheel for nycflights13: filename=nycflights13-0.0.3-py3-none-any.whl size=8732741 sha256=32c0768dd7c992bd4953b9879b10a7fe3a9cc9ca0f2515488b321608d041aa4e
  Stored in directory: /home/ilias/.cache/pip/wheels/b6/27/3d/46507f17840b411b66f10620728643140d0e6fa037df0fc9d8
Successfully built nycflights13
Installing collected packages: nycflights13
Successfully installed nycflights13-0.0.3


In [3]:
from nycflights13 import flights

In [4]:
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [5]:
flights.columns[[0]]

Index(['year'], dtype='object')

In [6]:
flights['carrier'].unique()

array(['UA', 'AA', 'B6', 'DL', 'EV', 'MQ', 'US', 'WN', 'VX', 'FL', 'AS',
       '9E', 'F9', 'HA', 'YV', 'OO'], dtype=object)

In [2]:
!pip install plotly

Collecting plotly
  Downloading plotly-4.11.0-py2.py3-none-any.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 247 kB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=811606b4f7376429e0d1f3c8ff5f8ba738d648ce78e2242ef3f1c321b1a61bd8
  Stored in directory: /home/ilias/.cache/pip/wheels/c4/a7/48/0a434133f6d56e878ca511c0e6c38326907c0792f67b476e56
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.11.0 retrying-1.3.3


## Exploratory data analysis, 

or EDA, is a comparatively new area of statistics. Classical
statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusons about large populations based on small samples. In
1962, John W. Tukey called for a reformation of statistics in his seminal paper “The
Future of Data Analysis”. He proposed a new scientific discipline called “Data
Analysis” that included statistical inference as just one component. Tukey forged links
to the engineering and computer science communities (he coined the terms “bit,”
short for binary digit, and “software”), and his original tenets are suprisingly durable
and form part of the foundation for data science. The field of exploratory data analy‐
sis was established with Tukey’s 1977 now classic book “Exploratory Data Analysis”.


In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg")


### There are two basic types of structured data:
- <font color = "red">numeric</font> 
- <font color = "red">categorical</font> <br>

Numeric data comes in two forms: <mark style="background-color: lightblue">continuous</mark>, such as wind speed or time duration, and <mark style="background-color: lightblue">discrete</mark>,
such as the count of the occurence of an event. 

Categorical data takes only a fixed set
of values, such as a type of TV screen (plasma, LCD, LED, …) or a state name (Ala‐
bama, Alaska, …). <mark style="background-color: lightblue">Binary</mark> data is an important special case of categorical data that
takes on only one of two values, such as 0/1, yes/no or true/false. Another useful type
of categorical data is <mark style="background-color: lightblue">ordinal</mark> data in which the categories are ordered; an example of
this is a numerical rating (1, 2, 3, 4, or 5).

#### Why to bother with data types?
Data-typing in software acts as a signal to the software on how to process the
data.

### Mean
The most basic estimate of location is the **mean**, or average value. The mean is the
sum of all the values divided by the number of values. Consider the following set of
numbers: ${3, 5, 1, 2}$. The mean is $(3+5+1+2)/4= 11/4 = 2.75$. You will encounter the
symbol $\bar{x}$ to represent the mean of a sample from a population.
The formula to compute the mean for a set of N values $x_1 , x_2 ,\ldots, x_N$ is
$$ Mean = \bar{x} =\frac{∑_i^N x_i}{N}$$

A variation of the mean is a **trimmed mean**, calculated by dropping a fixed number of
sorted values at each end and then take an average of the remaining values. Representing the sorted by $x_1 , x_2 , ..., x_N$ where $x_1$ is the smallest value and $x_N$ , the formula to compute the trimmed mean with p smallest and largest values omitted is
A trimmed mean eliminates the influence of extreme values. For example, scoring for
international diving is obtained dropping the top and bottom score from five judges
and taking the average of the three remaining judges. This makes it difficult for a
single judge to manipulate the score, perhaps to favor their country’s contestant.
Trimmed means are widely used, and in many cases, are preferable to use instead of
the ordinary mean: see “Median and Robust Estimates” on page 21 for further discus‐
sion.
Another type of mean is a **weighted mean**, calculated by multiplying each data value x i
by a weight w i and dividing their sum by the sum of the weights. The formula for a
weighted mean is
$$Weighted Mean = \bar{x}_w =\frac{\sum_{i=1}^N= w_i*x_i}{\sum_i^Nw_i}$$
1. There are two main motivations for using a weighted mean: Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight. For example, if we are taking the average
from multiple sensors and one of the sensors is less accurate, then we might
downweight the data from that sensor.
2. The data collected does not equally represent the different groups that we are
interested in measure. For example, because of the way an online experiment was
conducted, we may not have a set of data that accurately reflects all groups in the
user base. To correct that, we can give a higher weight to the values from the
groups that were underrepresented.

In [2]:
from pydataset import data

ModuleNotFoundError: No module named 'pydataset'

In [3]:
!conda install --name data_analytics pydataset


Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - pydataset

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.


