# Practical 2

## AIM : Exploring Numpy and Pandas 

### What is Numpy ?

The **NumPy** module is a fundamental package for numerical computation in Python. It stands for "Numerical Python." It provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Here are some of the key features and benefits of using NumPy:

* **N-dimensional Array Object (`ndarray`):** This is the core of NumPy. It's a powerful and efficient way to store and manipulate homogeneous data (usually numbers). These arrays can have any number of dimensions (1D, 2D, 3D, etc.).
* **Broadcasting:** NumPy's broadcasting mechanism allows you to perform arithmetic operations on arrays with different shapes under certain conditions, making your code more concise and efficient.
* **Mathematical Functions:** NumPy provides a vast library of mathematical functions for array operations, including:
    * Linear algebra routines
    * Fourier transforms
    * Random number generation
    * Basic statistical functions
    * Trigonometric, exponential, and logarithmic functions, etc.
* **Performance:** NumPy arrays are implemented in C, which makes numerical operations on them much faster than using standard Python lists, especially for large datasets.
* **Integration with other Libraries:** NumPy arrays are the standard data structure used by many other scientific Python libraries like Pandas, SciPy, and scikit-learn.

### What is Pandas ?

The **Pandas** module is a powerful and widely used Python library for data manipulation and analysis. It provides data structures that are specifically designed for working with structured data (like tables, spreadsheets, SQL databases, and time series).

The two primary data structures in Pandas are:

* **Series:** A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). Think of it like a column in a spreadsheet or a labeled one-dimensional array.
* **DataFrame:** A two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, where each column is a Series.

Here are some of the key functionalities and benefits of using Pandas:

* **Data Handling:** Pandas makes it easy to read and write data from various file formats (CSV, Excel, SQL, etc.).
* **Data Cleaning and Preparation:** It provides tools for handling missing data (e.g., filling or dropping NaNs), filtering, sorting, and merging datasets.
* **Data Transformation:** You can easily reshape, pivot, and aggregate data.
* **Data Analysis:** Pandas integrates well with other scientific libraries like NumPy and Matplotlib, allowing for powerful data analysis and visualization.
* **Labeled Axes:** Both Series and DataFrames have labeled axes (row labels and column labels), which make it intuitive to access and manipulate data.

## Diffrence between numpy and pandas 

NumPy and Pandas are both essential Python libraries for data science, but they serve different primary purposes and have distinct characteristics. Here's a breakdown of their key differences:

**NumPy (Numerical Python):**

* **Primary Focus:** Numerical computation, especially with arrays and matrices.
* **Core Data Structure:** The **ndarray**, an n-dimensional array that efficiently stores homogeneous data (usually numbers).
* **Functionality:** Provides high-performance routines for mathematical operations on arrays (linear algebra, Fourier transforms, random numbers, etc.).
* **Data Type Flexibility:** Primarily designed for numerical data, though it can technically hold other types, performance is optimized for numbers.
* **Structure:** Less emphasis on labels; data is primarily accessed by numerical index.
* **Use Cases:** Scientific computing, mathematical operations, the foundation for many other data science libraries.

**Pandas:**

* **Primary Focus:** Data manipulation and analysis of structured data.
* **Core Data Structures:**
    * **Series:** A one-dimensional labeled array.
    * **DataFrame:** A two-dimensional labeled table with columns of potentially different types.
* **Functionality:** Offers tools for data cleaning, transformation, merging, reshaping, and analysis. It excels at handling missing data and provides rich data manipulation capabilities.
* **Data Type Flexibility:** Designed to work with heterogeneous data (numbers, strings, dates, etc.) in a tabular format.
* **Structure:** Emphasizes labeled axes (row and column names/indices) for intuitive data access and alignment.
* **Use Cases:** Data analysis, data cleaning, working with tabular data (like spreadsheets, CSV files, SQL tables), time series analysis.

**Here's a table summarizing the key differences:**

| Feature             | NumPy                                    | Pandas                                         |
| :------------------ | :--------------------------------------- | :--------------------------------------------- |
| **Primary Goal** | Numerical computation                    | Data manipulation and analysis                 |
| **Core Structure** | `ndarray` (n-dimensional array)          | `Series` (1D labeled array), `DataFrame` (2D labeled table) |
| **Data Focus** | Primarily numerical, homogeneous in arrays | Heterogeneous, tabular data                  |
| **Labeling** | Primarily numerical indexing             | Labeled axes (rows and columns)               |
| **Missing Data** | Basic support                             | Excellent support (`NaN`)                      |
| **Use Cases** | Mathematical tasks, array operations     | Data cleaning, analysis, tabular data          |

In [25]:
import numpy as np
import pandas as pd

In [26]:
print(np.__version__, pd.__version__)

2.2.2 2.2.3


# Numpy methods

In [27]:
np.array([1,2,3])

array([1, 2, 3])

In [28]:
np.zeros(3)

array([0., 0., 0.])

In [29]:
np.ones(3)

array([1., 1., 1.])

In [30]:
np.arange(1,4)

array([1, 2, 3])

In [31]:
np.linspace(1.0 ,5.0 ,num = 3)

array([1., 3., 5.])

In [32]:
a = np.array([[1,2],[3,4]])
a

array([[1, 2],
       [3, 4]])

In [33]:
np.reshape(a,4)

array([1, 2, 3, 4])

In [34]:
b = np.array([[1,2],[3,4]])
print(f"b ={b}")
print("\n")
c = np.transpose(b)
print(f"c={c}")

b =[[1 2]
 [3 4]]


c=[[1 3]
 [2 4]]


In [35]:
d = np.dot(a, b)
d

array([[ 7, 10],
       [15, 22]])

In [36]:
np.sum(a)

np.int64(10)

In [37]:
np.mean(b)

np.float64(2.5)

In [38]:
e = np.array([15,72,83,26,38])
np.std(e)

np.float64(26.331729908990027)

In [39]:
np.max(e)

np.int64(83)

In [40]:
np.min(e)

np.int64(15)

In [41]:
f = np.array([[14,67,37],[24,62,89]])
print(np.argmax(f,axis = 0))
print(np.argmax(f,axis = 1))

[1 0 1]
[1 2]


In [42]:
print(np.argmin(f,axis = 0))
print(np.argmin(f,axis = 1))

[0 1 0]
[0 0]


In [43]:
x = np.random.rand(2,2,3)
print(np.shape(x))
x

(2, 2, 3)


array([[[0.11938493, 0.08198745, 0.64535861],
        [0.84981   , 0.17021475, 0.23961744]],

       [[0.41232394, 0.72924139, 0.76731007],
        [0.92395858, 0.49470598, 0.79110353]]])

In [44]:
y = np.random.randn(2,2,2)
print(np.shape(y))
y

(2, 2, 2)


array([[[-0.20785335, -1.33286452],
        [ 0.31273768,  0.16743116]],

       [[-0.78963735,  0.01786618],
        [-0.46757581, -0.55217922]]])

In [45]:
B = np.random.randint(0, 16272, 8)
B

array([ 5154, 10507,  2975, 10308, 15820, 11866,  6886, 13636],
      dtype=int32)

In [46]:
np.ones_like(B)

array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [47]:
np.zeros_like(B)

array([0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [48]:
A = np.eye(3)
A

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [49]:
j = np.diag(B)
j

array([[ 5154,     0,     0,     0,     0,     0,     0,     0],
       [    0, 10507,     0,     0,     0,     0,     0,     0],
       [    0,     0,  2975,     0,     0,     0,     0,     0],
       [    0,     0,     0, 10308,     0,     0,     0,     0],
       [    0,     0,     0,     0, 15820,     0,     0,     0],
       [    0,     0,     0,     0,     0, 11866,     0,     0],
       [    0,     0,     0,     0,     0,     0,  6886,     0],
       [    0,     0,     0,     0,     0,     0,     0, 13636]],
      dtype=int32)

In [50]:
temp_1 = np.array([1,2,3,4])
temp_2 = np.array([5,6,7,8])

f = np.concatenate((temp_1,temp_2), axis=0)
f

array([1, 2, 3, 4, 5, 6, 7, 8])

In [51]:
g = np.split(f,2,axis=0)
g

[array([1, 2, 3, 4]), array([5, 6, 7, 8])]

In [52]:
np.random.choice(11,8,replace = False)

array([ 2,  7,  1,  5,  8, 10,  6,  0], dtype=int32)

In [53]:
np.where(f < 6,f ,5*f)

array([ 1,  2,  3,  4,  5, 30, 35, 40])

In [54]:
np.unique([1,2,1,4,3,5,2,3],return_counts=False)

array([1, 2, 3, 4, 5])

In [55]:
A = np.array([1,2,3,])
B = np.array([4,5,6])

In [56]:
np.vstack((A,B))

array([[1, 2, 3],
       [4, 5, 6]])

In [57]:
np.hstack((A,B))

array([1, 2, 3, 4, 5, 6])

In [58]:
C = np.array([[10, 20], [30, 40]])
np.linalg.inv(C)

array([[-0.2 ,  0.1 ],
       [ 0.15, -0.05]])

# Pandas Methods 

In [59]:
series = pd.Series([1, 2, 3,4 ,5 ,6])
series

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [60]:
data = {'A': [1, 2], 'B': [3, 4]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,3
1,2,4


In [61]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.head(2)

Unnamed: 0,A,B
0,1,4
1,2,5


In [62]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.tail(2)

Unnamed: 0,A,B
1,2,5
2,3,6


In [63]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


In [64]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.describe()

Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


In [65]:
df.shape

(3, 2)

In [66]:
df.drop('A',axis=1)

Unnamed: 0,B
0,4
1,5
2,6


In [67]:
df.rename(columns={'B': 'A'})

Unnamed: 0,A,A.1
0,1,4
1,2,5
2,3,6


In [68]:
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
df.isnull()

Unnamed: 0,A,B
0,False,False
1,True,False
2,False,True


In [69]:
df.notnull()

Unnamed: 0,A,B
0,True,True
1,False,True
2,True,False


In [70]:
df.fillna(2)

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,5.0
2,3.0,2.0


In [71]:
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
df.dropna()

Unnamed: 0,A,B
0,1.0,4.0


In [72]:
df = pd.DataFrame({'A': [1, 1, 2], 'B': [3, 4, 5]})
grouped = df.groupby('A').sum()
grouped

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
1,7
2,5


In [73]:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'C': [5, 6]})
merged = pd.merge(df1, df2, on='A')
merged

Unnamed: 0,A,B,C
0,1,3,5
1,2,4,6


In [74]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
concatenated = pd.concat([df1, df2])
concatenated

Unnamed: 0,A
0,1
1,2
0,3
1,4


In [75]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': ['X', 'X', 'Y']})
pivot = df.pivot_table(values='B', index='C', aggfunc='sum')
pivot

Unnamed: 0_level_0,B
C,Unnamed: 1_level_1
X,9
Y,6


In [76]:
df

Unnamed: 0,A,B,C
0,1,4,X
1,2,5,X
2,3,6,Y


In [77]:
df.apply(lambda x: x * 2)

Unnamed: 0,A,B,C
0,2,8,XX
1,4,10,XX
2,6,12,YY


In [78]:
df.iloc[1]

A    2
B    5
C    X
Name: 1, dtype: object

In [79]:
df.loc[1,'A']

np.int64(2)

In [80]:
df = pd.DataFrame({'A': [3, 1, 2], 'B': [6, 5, 4]})
df.sort_values(by='A')

Unnamed: 0,A,B
1,1,5
2,2,4
0,3,6


In [81]:
df.sort_index()

Unnamed: 0,A,B
0,3,6
1,1,5
2,2,4


In [82]:
df.corr()

Unnamed: 0,A,B
A,1.0,0.5
B,0.5,1.0


In [83]:
df.cumsum()

Unnamed: 0,A,B
0,3,6
1,4,11
2,6,15


In [84]:
df.value_counts()

A  B
1  5    1
2  4    1
3  6    1
Name: count, dtype: int64

In [85]:
df = pd.Series(['apple', 'banana', 'cherry'])
df.str.contains('a')

0     True
1     True
2    False
dtype: bool

In [86]:
df = pd.DataFrame({'A': [1, 2,3,4,5,6,7,8,9,10, ], 'B': [10, 20,30,40,50,60,70,80,90,100]})
df.to_csv('output.csv')

In [87]:
df = pd.DataFrame({'A': [1, 2,3,4,5,6,7,8,9,10, ], 'B': [10, 20,30,40,50,60,70,80,90,100]})
df.to_excel('output.xlsx')

In [88]:
df = pd.DataFrame({'A': [1, 2,3,4,5,6,7,8,9,10, ], 'B': [10, 20,30,40,50,60,70,80,90,100]})
df.to_json('output.xlsx')

### extra pandas commands

In [89]:
import pandas as pd

In [90]:
pd.__version__

'2.2.3'

In [91]:
import pandas as pd

data = [10, 20, 30]
index = ['a', 'b', 'c']
series = pd.Series(data, index=index)
series

a    10
b    20
c    30
dtype: int64

In [92]:
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


In [93]:
df.head()

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


In [94]:
df.tail()

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


In [95]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2 non-null      object
 1   Age     2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 164.0+ bytes


In [96]:
df.describe()

Unnamed: 0,Age
count,2.0
mean,27.5
std,3.535534
min,25.0
25%,26.25
50%,27.5
75%,28.75
max,30.0


In [97]:
data = {'Category': ['A', 'B', 'A', 'B'], 'Value': [10, 20, 30, 40]}
df = pd.DataFrame(data)
df.groupby('Category').sum()

Unnamed: 0_level_0,Value
Category,Unnamed: 1_level_1
A,40
B,60


In [98]:
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Score': [90, 85]})
merged_df = pd.merge(df1, df2, on='ID')
merged_df

Unnamed: 0,ID,Name,Score
0,1,Alice,90
1,2,Bob,85


In [99]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
concatenated_df = pd.concat([df1, df2])
concatenated_df

Unnamed: 0,A
0,1
1,2
0,3
1,4


In [100]:
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]}, index=['a', 'b', 'c'])
df.loc['b', 'A']

np.int64(20)

In [101]:
df.query('A >15')

Unnamed: 0,A,B
b,20,50
c,30,60


In [102]:
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
df_cleaned = df.dropna()
df_cleaned

Unnamed: 0,A,B
0,1.0,4.0


In [103]:
df.fillna(0)

Unnamed: 0,A,B
0,1.0,4.0
1,0.0,5.0
2,3.0,0.0


In [104]:
df = pd.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]})
df_no_duplicates = df.drop_duplicates()
df_no_duplicates

Unnamed: 0,A,B
0,1,4
1,2,5
3,3,6


In [105]:
df.rename(columns={'A': 'Alpha', 'B': 'Beta'})

Unnamed: 0,Alpha,Beta
0,1,4
1,2,5
2,2,5
3,3,6


In [106]:
df = pd.DataFrame({'ID': [1, 2], 'Math': [90, 85], 'Science': [88, 92]})
df_melted = pd.melt(df, id_vars=['ID'], var_name='Subject', value_name='Score')
df_melted

Unnamed: 0,ID,Subject,Score
0,1,Math,90
1,2,Math,85
2,1,Science,88
3,2,Science,92


In [107]:
df_melted.pivot(index='ID', columns='Subject', values='Score')

Subject,Math,Science
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,90,88
2,85,92


In [108]:
df = pd.DataFrame({'A': [10, 20, 30,40, 50, 60], 'B': [40, 50, 60,70,80,90]})
df.mean()

A    35.0
B    65.0
dtype: float64

In [109]:
df.median()

A    35.0
B    65.0
dtype: float64

In [110]:
df.std()

A    18.708287
B    18.708287
dtype: float64

In [111]:
df.corr()

Unnamed: 0,A,B
A,1.0,1.0
B,1.0,1.0
