In [1]:
import numpy as np 
import pandas as pd

## What is DATA?
Data is a collection of facts, information, and statistics that can be in various forms such as numbers, text, sound, images, or any other format. It is the raw material from which information and knowledge are derived. Data can be measured, collected, reported, and analyzed, and it is often visualized using graphs, images, or other analysis tools

## What is Information ?
Information is data that has been processed , organized, or structured in a way that makes it meaningful, valuable and useful. It is data that has been given context , relevance and purpose. It gives knowledge, understanding and insights that can be used for decision-making , problem-solving, communication and various other purposes.

## Categories of Data
Data can be catogeries into two main parts –

* Structured Data: This type of data is organized data into specific format, making it easy to search , analyze and process. Structured data is found in a relational databases that includes information like numbers, data and categories.
* UnStructured Data: Unstructured data does not conform to a specific structure or format. It may include some text documents , images, videos, and other data that is not easily organized or analyzed without additional processing.

## Types of Data
### 1. Quantitative Data (Numerical Data)
Quantitative data represents numerical values and can be measured or counted. It is further classified into two categories:

#### a. Discrete Data
Discrete data consists of distinct, separate values that can be counted as whole numbers. Examples include the number of students in a class, marks of students in a test, and the number of cars in a parking lot. Discrete data is often visualized using bar charts

#### b. Continuous Data
Continuous data represents measurements that can take any value within a given range. Examples include temperature, height, weight, and salary. Continuous data is often visualized using histograms and line charts

#### c. Time-Series Data
Time-series data is collected or recorded over a sequence of equally spaced time intervals. It represents how a particular variable changes over time. Examples include daily stock prices, weather data, and monthly sales figures. Time-series data is visualized using line charts

### 2. Qualitative Data (Categorical Data)
Qualitative data, also known as categorical data, describes qualities, characteristics, or opinions. It is non-numerical and is used to categorize observations into groups. Qualitative data is further divided into two categories:

#### a. Nominal Data
Nominal data consists of categories or names that cannot be ordered or ranked. Examples include gender (male, female), race (White, Black, Asian), and blood type (A, B, AB, O). Nominal data is analyzed using non-parametric tests like Chi-Squared Tests and Fisher’s Exact Tests

#### b. Ordinal Data
Ordinal data consists of categories that can be ordered or ranked, but the distance between categories is not necessarily equal. Examples include education level (Elementary, Middle, High School, College) and job position (Manager, Supervisor, Employee). Ordinal data is analyzed using non-parametric tests like the Wilcoxon Signed-Rank test and Mann-Whitney U test

## Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

### The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows


In [2]:
series = pd.Series([0.25, 0.5, 0.75, 1.0])
series

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
series[2]

0.75

In [4]:
series[:3]

0    0.25
1    0.50
2    0.75
dtype: float64

In [5]:
series.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
series.values

array([0.25, 0.5 , 0.75, 1.  ])

In [7]:
#This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [9]:
# data['a']
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [11]:
# Series as specialized dictionary

population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127,'Florida': 19552860,'Illinois': 12882135}
population_series = pd.Series(population_dict)
population_series

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [12]:
population_series['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

### The Pandas DataFrame Object
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names

In [None]:
apple = pd.Series([3, 2, 0, 1]) 
orange = pd.Series([0, 3, 7, 2])
index = pd.Index(['June', 'Robert', 'Lily', 'David'])
purchase = pd.DataFrame({"Apple":apple, "Orange":orange})
purchase.set_index(index)

Unnamed: 0,Apple,Orange
June,3,0
Robert,2,3
Lily,0,7
David,1,2
