# **Pandas DataFrame**

1. Definition of a DataFrame:
   * A two-dimensional data structure
   * Has flexible row indices and column names
   * Can be thought of as a collection of aligned Series objects

2. Relationship to Other Data Structures:
   * Generalization of a NumPy array:
      * Like a NumPy array, but with flexible row and column labels
      * Can contain multiple data types across columns (unlike NumPy arrays)
   * Specialization of a Python dictionary:
      * Like a dict of Series objects, all sharing the same index

3. Conceptual View:
    * Each column in a DataFrame is essentially a Series
    * All Series in a DataFrame share the same index

4. Key Characteristics:
   * Has both row and column labels
   * Can contain heterogeneous data types across columns
   * Size-mutable: columns can be added or removed
   * Both rows and columns are indexable

5. Advantages:

    * More intuitive data access using meaningful labels
    * Easier to work with heterogeneous data
    * Built-in alignment based on labels

In [6]:
import pandas as pd
   
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
           'Age': [28, 34, 29, 32],
           'City': ['New York', 'Paris', 'Berlin', 'London']}
   
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin
3,Linda,32,London


# DataFrame as a Generalized NumPy Array

Key Differences from NumPy Arrays:

    * Labeled axes (rows and columns)
    * Ability to contain multiple data types across columns


#### Creating a DataFrame:
**Combines two Series (names and arts) into a single DataFrame**

In [17]:
name_dict = {'Aardra': 111, 'Arya': 112, 'Federy': 113,
                'Riya': 114, 'Sree': 115}
names = pd.Series(name_dict)


art_dict = {'Aardra': 'Music', 'Arya': 'Dance', 'Federy': 'Flute',
                'Riya': 'Music', 'Sree': 'Dance'}
arts = pd.Series(art_dict)

In [23]:
DF = pd.DataFrame({'names': names, 'arts': arts})
DF

Unnamed: 0,names,arts
Aardra,111,Music
Arya,112,Dance
Federy,113,Flute
Riya,114,Music
Sree,115,Dance


**DataFrame Components:**

In [26]:
print(DF.index)
print(DF.columns)

Index(['Aardra', 'Arya', 'Federy', 'Riya', 'Sree'], dtype='object')
Index(['names', 'arts'], dtype='object')


## DataFrame as specialized dictionary

DataFrame can be thought of as a specialized dictionary where each column name maps to a Series of data. 
* Each column in a DataFrame is essentially a Series, and the column name acts as the key to access that Series.

**Accessing columns:**

In [34]:
DF['names']

Aardra    111
Arya      112
Federy    113
Riya      114
Sree      115
Name: names, dtype: int64

1. Difference from NumPy arrays:
* In a 2D NumPy array, data[0] returns the first row.
* In a pandas DataFrame, data['col0'] returns the first column.

2. Conceptual model:
* While DataFrames can be viewed as either generalized dictionaries or generalized arrays, the dictionary analogy is often more intuitive, especially when working with named columns.

3. Further indexing:
* .loc[] for label-based indexing
* .iloc[] for integer-based indexing
* Boolean indexing
* Multi-level indexing

In [39]:
import pandas as pd

# Create a sample DataFrame
data = {
    'state': ['California', 'Texas', 'Florida', 'New York'],
    'area': [423970, 695662, 170312, 141297],
    'population': [39538223, 29145505, 21538187, 20201249]
}

states = pd.DataFrame(data)

In [41]:
# Access the 'area' column
print(states['area'])

0    423970
1    695662
2    170312
3    141297
Name: area, dtype: int64


In [43]:
# Access the first row
print(states.iloc[0])

state         California
area              423970
population      39538223
Name: 0, dtype: object


In [45]:
# Access the 'area' for 'California'
print(states.loc[states['state'] == 'California', 'area'])

0    423970
Name: area, dtype: int64


## **Constructing DataFrames:**

1. From a single Series
2. From a list of dictionaries
3. From a dictionary of Series objects
4. From a two-dimensional NumPy array
5. From a NumPy structured array

In [52]:
import pandas as pd
import numpy as np

# From a single Series
population = pd.Series([38332521, 19552860, 12882135, 19651127, 26448193],
                       index=['California', 'Florida', 'Illinois', 'New York', 'Texas'])
df = pd.DataFrame(population, columns=['population'])
df

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [54]:
# From a list of dicts
data = [{'a': i, 'b': 2 * i} for i in range(3)]
df = pd.DataFrame(data)
df

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [60]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [56]:
# From a dictionary of Series
area = pd.Series([423967, 170312, 149995, 141297, 695662],
                 index=['California', 'Florida', 'Illinois', 'New York', 'Texas'])
df = pd.DataFrame({'population': population, 'area': area})
df

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


In [58]:
# From a 2D NumPy array
df = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])
df

Unnamed: 0,foo,bar
a,0.304137,0.131905
b,0.684456,0.695771
c,0.688643,0.974269


In [62]:
#From NumPy Structured Array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0
