# DataFrame

# Some of the popular libraries

1. **NumPy:** -Numerical computing, array operations.

2. **Pandas:**- Data manipulation, tabular data analysis.

3. **Matplotlib:**- Data visualization, plotting.

4. **Seaborn:**- Statistical data visualization.

5. **SciPy:**- Scientific computing, optimization.

6. **Scikit-learn:**- Machine learning, data mining.

7. **TensorFlow:**- Deep learning, machine learning.

8. **PyTorch:**- Deep learning, machine learning.

9. **Keras:**- Neural networks API, deep learning.

10. **OpenCV:**- Computer vision, image processing.

11. **Beautiful Soup:**- Web scraping, HTML parsing.

12. **Requests:**- HTTP requests, web communication.

13. **NLTK:**- Natural language processing, text analysis.

14. **Django:**- Web framework, MVC architecture.

15. **Flask:**- Lightweight web framework.

16. **SQLAlchemy:**- Database interaction, ORM.

17. **Statsmodels:**- Statistical modeling, hypothesis testing.

18. **Bokeh:**- Interactive data visualization, web.

19. **NetworkX:**- Network analysis, graph theory.


These libraries consist of multiple packages, with each package containing various modules. Each module comprises functions and classes. Classes, in turn, are composed of attributes and methods. An attribute represents the value associated with the class, while a method is a function associated with the class. It's important to note that classes and functions can also exist directly in the library without strictly adhering to the hierarchical structure of packages and modules.


in pandas.core.frame 
1. pandas = library
2. core = package
3. frame= module

in frame module
1. class = Dataframe()
2. attributes= index, shape, columns etc
3. methods= DataFrame.head(), tail(), describe() etc

now some of the functions directly lies in library like
1. pandas.read_csv()
2. pandas.concat() and more

now object is "df=DataFrame()


In [98]:
import pandas as pd
import numpy as np

In [23]:
d={
    "one":pd.Series([1,2,3,4,5]),
    "two":pd.Series(["a", "b","c","d","e"])

}

df=pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e


In [24]:
d={
    'Name':['Alice', 'Bob', 'Charlie'],
    'Age':[34,45,22],
    'City':['New York', 'San Francisco', 'Los Angeles']
}

df=pd.DataFrame(d)
df

Unnamed: 0,Name,Age,City
0,Alice,34,New York
1,Bob,45,San Francisco
2,Charlie,22,Los Angeles


In [25]:
d={
    "Name":("Alice", "Bob", "Charlie"),
    "Age":(34,45,22),
    "city":("New York", "San Francisco", "Los Angeles")
}

df=pd.DataFrame(d)
df



Unnamed: 0,Name,Age,city
0,Alice,34,New York
1,Bob,45,San Francisco
2,Charlie,22,Los Angeles


In [26]:
d={
    "list_1":np.array([1,2,3,4,5]),
    "list_2":np.array([1.1,2.2,3.3,4.4,5.5])
}

pd.DataFrame(d)

Unnamed: 0,list_1,list_2
0,1,1.1
1,2,2.2
2,3,3.3
3,4,4.4
4,5,5.5


In [27]:
import pandas as pd
import numpy as np

In [28]:
s1=pd.Series(np.random.randn(5),index=list('abcde'), name="col_1")
s2=pd.Series(np.random.randn(5), index=list('abcde'),name="col_2")

df=pd.DataFrame([s1, s2])
print(df)

              a         b         c         d         e
col_1 -0.666910  1.558146 -1.356631 -0.879514 -0.171357
col_2 -0.076428  0.530907  0.004267  1.623219  0.718948


In [29]:
a=np.array([[1,2,3],["A", "B", "C"]])
df=pd.DataFrame(a, columns=["col_1","col_2", "col_3"])
df

Unnamed: 0,col_1,col_2,col_3
0,1,2,3
1,A,B,C


In [2]:
import pandas as pd

In [3]:
l1=[1,2,3]
l2=['a','b','c']

pd.DataFrame([l1,l2]).T

Unnamed: 0,0,1
0,1,a
1,2,b
2,3,c


In [31]:
ser = pd.Series(range(3), index=list("abc"), name="ser")
pd.DataFrame(ser)

Unnamed: 0,ser
a,0
b,1
c,2


In [32]:
#from alternate constructors
df=pd.DataFrame.from_dict(dict([("A",[1,2,3]), ("B", [2,3,4])]))

In [33]:
df['A']

0    1
1    2
2    3
Name: A, dtype: int64

In [34]:
df['C']=df['A']*2 #(+,-,*,/,//,**,%)


In [35]:
df['D']='four'

In [36]:
df

Unnamed: 0,A,B,C,D
0,1,2,2,four
1,2,3,4,four
2,3,4,6,four


In [37]:
df['one_trunc']=df['A'][:2]

In [38]:
df

Unnamed: 0,A,B,C,D,one_trunc
0,1,2,2,four,1.0
1,2,3,4,four,2.0
2,3,4,6,four,


In [39]:
df.insert(1,"bar", df["A"])

In [40]:
df.insert(loc=1, column='new_col', value=[11,22,np.nan])


In [41]:
df

Unnamed: 0,A,new_col,bar,B,C,D,one_trunc
0,1,11.0,1,2,2,four,1.0
1,2,22.0,2,3,4,four,2.0
2,3,,3,4,6,four,


In [42]:
#assign() always returns a copy of the data, leaving the original dataframe untouched
df.assign(last_col=df['C']/df['A'])

Unnamed: 0,A,new_col,bar,B,C,D,one_trunc,last_col
0,1,11.0,1,2,2,four,1.0,2.0
1,2,22.0,2,3,4,four,2.0,2.0
2,3,,3,4,6,four,,2.0


In [43]:
df

Unnamed: 0,A,new_col,bar,B,C,D,one_trunc
0,1,11.0,1,2,2,four,1.0
1,2,22.0,2,3,4,four,2.0
2,3,,3,4,6,four,


In [44]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [45]:
np.zeros((3,4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [46]:
np.zeros((2,3,2))

array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

In [47]:
np.zeros((3,), dtype=int)

array([0, 0, 0])

In [48]:
#1D array with two elements
#here the two elements are the 2 tuples 
#each tuple has 3 elements
data=np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data=[(1,2.0,"Hello"),(2, 3.0, "World")]
print(data)

[(1, 2.0, 'Hello'), (2, 3.0, 'World')]


In [49]:
#from structured or record array 
#i4 = integer, f4=float, a10= string with max length of 10 character
data=np.zeros((2,), dtype=[("A", "i4"),("B","f4"),("C","a10")])

#created the 1d array with two tuple element 
data[:]=[(1,2.0,"Hello"), (2,3.0, "World")]

df=pd.DataFrame(data)
df
# the values in column c has the b prefix 
#since the string are represented as bytes rather than unicode strings


Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [50]:
print(data)

[(1, 2., b'Hello') (2, 3., b'World')]


In [51]:
#we can represent the strings in column c as unicode strings instead of bytes
df['C']=df['C'].str.decode('utf-8')
df

Unnamed: 0,A,B,C
0,1,2.0,Hello
1,2,3.0,World


In [52]:
list_1=[(1,2.0, "Hello"), (2,5.0,"World")]

pd.DataFrame(list_1, columns=["col_1", "col_2", "col_3"])


Unnamed: 0,col_1,col_2,col_3
0,1,2.0,Hello
1,2,5.0,World


In [53]:
data_1=np.array(list_1)
pd.DataFrame(data_1, columns=["col_1", "col_2", "col_3"])

Unnamed: 0,col_1,col_2,col_3
0,1,2.0,Hello
1,2,5.0,World


In [54]:
import numpy as np
random_array=np.random.rand(3,3)

In [55]:
random_array

array([[0.51101577, 0.3531533 , 0.21902801],
       [0.18127995, 0.74914821, 0.48376842],
       [0.24341937, 0.21619516, 0.24708313]])

In [56]:
pd.DataFrame(random_array)

Unnamed: 0,0,1,2
0,0.511016,0.353153,0.219028
1,0.18128,0.749148,0.483768
2,0.243419,0.216195,0.247083


In [57]:
data2=[{"A":1, "B":2}, {"A":5, "B":10, "C":20}]

pd.DataFrame(data2, index=["index_1", "index_2"])

Unnamed: 0,A,B,C
index_1,1,2,
index_2,5,10,20.0


In [58]:
from collections import namedtuple

two_point=namedtuple("twopoint", "x y")
three_point=namedtuple("threepoint", "x y z")

pd.DataFrame([two_point(0,0), two_point(0,3), two_point(2,3)])


Unnamed: 0,x,y
0,0,0
1,0,3
2,2,3


In [59]:
pd.DataFrame([three_point(9,5,3), three_point(4,3,2), two_point(3,3)])

Unnamed: 0,x,y,z
0,9,5,3.0
1,4,3,2.0
2,3,3,


In [60]:
# from a list of dataclass
from dataclasses import make_dataclass

point=make_dataclass("Point", [("x", int), ("y", int)])

pd.DataFrame([Point(0,0), Point(0, 3), Point(2,3)])

Unnamed: 0,x,y
0,0,0
1,0,3
2,2,3


##### here the namedtuple and the make_dataclass almost similar and are used to create simple classes. Whereas once created the namedtuple are ummutable and the make_dataclass are mutable

In [61]:
#the namedtuple are immutable. 
#means their values cannot be changed once set

from collections import namedtuple
Point=namedtuple("Point", "x y")
p=Point(1,2)
print(p)
p.x=3 #it has arrised the error since now the value can not be updated.

Point(x=1, y=2)


AttributeError: can't set attribute

In [62]:
#the make_dataclass is mutable. 
#means the values of the dataclass can be updated afterwards

from dataclasses import make_dataclass

Point=make_dataclass("Point", [("x", int), ("y", int)])

p=Point(1,2)

p.x=3
p.y=4

print(p)

Point(x=3, y=4)


#### Alternate constructors

In [63]:
pd.DataFrame.from_dict(
    dict(
        [
            ("A", [1,2,3]),
            ("B", [4,5,6])
        ]
    )
)

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [64]:
df=pd.DataFrame.from_dict(
    dict(
        [
            ("A", [1,2,3]),
            ("B", [4,5,6])
        ]
    ),
    orient="index",
    columns=["one", "two", "three"],
)
df

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


In [65]:
df.assign(four=df['one']+df['two'], five=df['one']*2)

Unnamed: 0,one,two,three,four,five
A,1,2,3,3,2
B,4,5,6,9,8


dataframe and numpy array 

Homogeneous data

In [74]:
df=pd.DataFrame(
    {
        "A":[1,2,3],
        "B":[4,5,6]
    }
)
df_a=df.to_numpy()
print(df_a)
df_a[0,0]=99 #underlying array of the data can be modified inplace
print("======")
print(df_a)
print("=====")
print(df)# changes will appear in df as well
#this happen because of the priciples of data sharing and optimization

[[1 4]
 [2 5]
 [3 6]]
[[99  4]
 [ 2  5]
 [ 3  6]]
=====
    A  B
0  99  4
1   2  5
2   3  6


Heterogeneous Data

In [79]:
df=pd.DataFrame(
    {
        "A":[1,2,3],
        "B":['a','b','c']
    }
)
df_a=df.to_numpy()
df_a[0,0]=99 #this operation will raise an error
print(df_a) #the array got updated 
print(df)# but the does not get updated

[[99 'a']
 [2 'b']
 [3 'c']]
   A  B
0  1  a
1  2  b
2  3  c


In [84]:
#if one of the column of numaric data has a string data then the datatype of the column will be object 
df=pd.DataFrame(
    {
        "A":[1,2,"three"],
        "B":[4,5,6]
    }
)
print(df["A"].dtype)
print(df["B"].dtype)

object
int64


In [99]:
s=pd.Series(
    pd.Categorical(
        ["a","b","c"]
    )
)
a1=s.values #sometimes numpy array, sometimes extensionarray(so always avoid to use this)
print(type(a1))
a2=s.to_numpy()#always gives numpy array (most prefered way)
print(type(a2)) 
a3=s.array #always gives the extensionarray 
print(type(a3))

<class 'pandas.core.arrays.categorical.Categorical'>
<class 'numpy.ndarray'>
<class 'pandas.core.arrays.categorical.Categorical'>
