# Pandas Series
Series is a 1D labeled array capable of holding any data type-int, str, float, python objects, etc.

### Creating Series using Python list or dict

In [5]:
import pandas as pd
import numpy as np

s = pd.Series([1,2,3,4,5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [6]:
s = pd.Series(['Xtern', 'abc', '123'])
s

0    Xtern
1      abc
2      123
dtype: object

In [8]:
s = pd.Series({'a': 1, 'b': 2, 'c': 3})
s

a    1
b    2
c    3
dtype: int64

### Creating Series from numpy ndarray

In [11]:
data = np.array([10, 40, 45, 67, 78])

series = pd.Series(data)
series

0    10
1    40
2    45
3    67
4    78
dtype: Int64

In [31]:
data = np.random.randint(10, 50, size = (2,3))

series = pd.Series(data)
series

ValueError: Data must be 1-dimensional, got ndarray of shape (2, 3) instead

#### We get a value Error because we are trying to insert a 2D array into a 1D data structure in pandas

### Creating Series from scalar

In [13]:
index = ['a', 'b', 'c', 'd']

pd.Series(5.5, index = index)

a    5.5
b    5.5
c    5.5
d    5.5
dtype: float64

### Accessing properties/attributes and methods of Series

In [14]:
data = np.array([10, 23, 45, 67, 89])

series = pd.Series(data)
series

0    10
1    23
2    45
3    67
4    89
dtype: Int64

#### Attributes

In [15]:
print('Data type:', series.dtype, '\n')
print('Shape:', series.shape, '\n')
print('Values:', series.values, '\n')
print('Array:', series.array, '\n')

Data type: Int64 

Shape: (5,) 

Values: <IntegerArray>
[10, 23, 45, 67, 89]
Length: 5, dtype: Int64 

Array: <IntegerArray>
[10, 23, 45, 67, 89]
Length: 5, dtype: Int64 



#### Methods

In [16]:
# to extract back to numpy array

to_numpy = series.to_numpy()
to_numpy

array([10, 23, 45, 67, 89])

In [17]:
series.head()

0    10
1    23
2    45
3    67
4    89
dtype: Int64

In [18]:
series.tail()

0    10
1    23
2    45
3    67
4    89
dtype: Int64

In [20]:
series.info()

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      Int64
dtypes: Int64(1)
memory usage: 173.0 bytes


### Accessing Data Using Indexing and Slicing

In [21]:
s = pd.Series([1, 2, 3, 4, 5, 6])

print(s[2])

3


In [22]:
print(s[1:])

print(s[1:4])

1    2
2    3
3    4
4    5
5    6
dtype: int64
1    2
2    3
3    4
dtype: int64


In [23]:
# to retrieve multiple elements (rows and columns)

print(s[[2, 4]])

2    3
4    5
dtype: int64


In [24]:
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

s = pd.Series([1, 2, 3, 4, 5, 6, 7], index = index)
s

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64

In [25]:
print(s['a'])

1


In [26]:
# to retrieve multiple elements (rows and columns)

print(s[['a', 'c', 'e']])

a    1
c    3
e    5
dtype: int64


In [27]:
print(s['h'])

KeyError: 'h'

The 'h' index or label is missing, that's why we're getting that KeyError. Using Series.get() a missing label will return None or a specific default.

In [28]:
print(s.get('h'))

None


In [32]:
# OR

print(s.get('h', np.nan))

nan


## Uses of Pandas Series
### Store One-Dimensional Data:

- It’s great for storing a single column of data like a list of names, numbers, dates, etc.

### Labeling Data with Index:

- You can assign custom labels (index) to each value. This is helpful for referencing or slicing data by name rather than position.

### Data Manipulation:

- Easy to apply operations (like filtering, arithmetic, etc.) across all elements.

- Built-in vectorized operations, making it more efficient than regular Python lists.

### Missing Data Handling:

- Can handle NaN (Not a Number) values gracefully, useful for real-world data.

### Time Series Data:

- Ideal for indexing and working with time series data.

### Intermediate Step in Data Analysis:

- Often used when working with or extracting columns from a DataFrame to perform operations before assigning results back.

## When to Use Pandas Series
- When you're working with one-dimensional data (like a list or a column from a DataFrame).

- When you need label-based indexing for elements.

- When you want to perform vectorized operations on a 1D dataset.

- When you're preparing or cleaning a single column of a larger dataset.

- When building or analyzing features for machine learning one column at a time.

- When handling time-indexed data (like stock prices or logs).

# Pandas DataFrame

- It is a 2D labeled data structure
- They are value mutable and size-mutable
- A tabular structure with hetereogenously-typed columns
- In pandas a data table is called a DataFrame whereas each column is called a Series

### Syntax for creating a Pandas DataFrame

In [33]:
df = pd.DataFrame(data, index=idx, columns=cols)

NameError: name 'idx' is not defined

Data can be many different things, e.g, python, dict, list, or tuple.

### Creating a dataframe using python dictionary

In [34]:
data = {
    'Name': ['Ann', 'Jane', 'Xavier', 'Justina'],
    'Age': [40, 46, 58, 67],
    'Gender': ["Female", "Female", "Male", "Female"]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Gender
0,Ann,40,Female
1,Jane,46,Female
2,Xavier,58,Male
3,Justina,67,Female


### Creating a df using tuple or lists

### Tuples in a list

In [39]:
data = [
   ('1/1/2019', 23, 45, 'Rain'),
   ('2/3/2019', 45, 12, 'Fog'),
   ('1/4/2019', 11, 34, 'Bright'),
   ('3/4/2019', 34, 56, 'Storm')
]

df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,1/1/2019,23,45,Rain
1,2/3/2019,45,12,Fog
2,1/4/2019,11,34,Bright
3,3/4/2019,34,56,Storm


### Creating a df using tuple or lists and indicating column names

### Tuples in a tuple

In [38]:

data = (
   ('1/1/2019', 23, 45, 'Rain'),
   ('2/3/2019', 45, 12, 'Fog'),
   ('1/4/2019', 11, 34, 'Bright'),
   ('3/4/2019', 34, 56, 'Storm')
)

df = pd.DataFrame(data, columns=['Date', 'Temperature', 'Windspeed', 'Event'])

df

Unnamed: 0,Date,Temperature,Windspeed,Event
0,1/1/2019,23,45,Rain
1,2/3/2019,45,12,Fog
2,1/4/2019,11,34,Bright
3,3/4/2019,34,56,Storm


### Creating a df using tuple or lists, indicating indexes and column names

### Lists in a Tuple

In [37]:
data = (
    ['1/1/2019', 23, 45, 'Rain'],
    ['2/3/2019', 45, 12, 'Fog'],
    ['1/4/2019', 11, 34, 'Bright'],
    ['3/4/2019', 34, 56, 'Storm']
)

df = pd.DataFrame(data, index=['T1', 'T2', 'T3', 'T4'], columns=['Date', 'Temperature', 'Windspeed', 'Event'])

df

Unnamed: 0,Date,Temperature,Windspeed,Event
T1,1/1/2019,23,45,Rain
T2,2/3/2019,45,12,Fog
T3,1/4/2019,11,34,Bright
T4,3/4/2019,34,56,Storm


### Creating dataframe using numpy array

In [40]:
arr = np.random.randint(100, 201, size=(1000, 100))
arr

array([[179, 105, 111, ..., 154, 174, 101],
       [114, 127, 197, ..., 136, 135, 184],
       [194, 198, 167, ..., 197, 138, 102],
       ...,
       [159, 105, 166, ..., 193, 125, 195],
       [155, 180, 142, ..., 158, 183, 136],
       [148, 175, 159, ..., 126, 139, 137]], dtype=int32)

### Converting to a dataframe and adding column names to the dataframe

In [42]:
df = pd.DataFrame(arr, columns=['col_'+str(i) for i in range(1, 101)])

df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_91,col_92,col_93,col_94,col_95,col_96,col_97,col_98,col_99,col_100
0,179,105,111,123,152,170,192,155,129,200,...,104,125,168,145,164,111,196,154,174,101
1,114,127,197,119,118,156,182,172,117,176,...,152,154,198,143,108,143,113,136,135,184
2,194,198,167,142,144,152,180,137,147,105,...,186,175,133,131,140,172,175,197,138,102
3,123,119,191,142,157,119,129,160,124,195,...,107,138,191,149,182,158,147,165,189,160
4,156,196,178,167,113,103,146,103,177,200,...,167,152,160,145,150,168,140,139,101,116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,127,108,187,149,149,101,140,108,184,182,...,135,119,180,144,183,130,118,139,134,110
996,112,193,105,151,195,126,165,133,142,162,...,199,116,111,139,134,191,132,121,112,178
997,159,105,166,195,180,121,110,169,121,119,...,141,197,168,145,104,165,129,193,125,195
998,155,180,142,157,179,167,117,122,198,171,...,174,127,145,179,158,194,138,158,183,136


### Accessing Attributes/Properties and Methods of DataFrame

In [43]:
# Creating dictionary of series
data = {
   'Name': pd.Series(['Neymar', 'Muse', 'Abu', 'Shalom', 'Jacky']),
   'Age': pd.Series([34, 56, 78, 23, 12]),
   'Rating': pd.Series([3.4, 5.0, 2.3, 1.0, np.nan])
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [46]:
# Attributes

print('Shape of the df: ', df.shape, '\n')
print('Names of the columns: ', df.columns, '\n')
print('Data types for each column: \n', df.dtypes, '\n')
print('Axes: \n', df.axes, '\n')
print('Returning data as a numpy array: \n', df.values, '\n')

Shape of the df:  (5, 3) 

Names of the columns:  Index(['Name', 'Age', 'Rating'], dtype='object') 

Data types for each column: 
 Name       object
Age         int64
Rating    float64
dtype: object 

Axes: 
 [RangeIndex(start=0, stop=5, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')] 

Returning data as a numpy array: 
 [['Neymar' 34 3.4]
 ['Muse' 56 5.0]
 ['Abu' 78 2.3]
 ['Shalom' 23 1.0]
 ['Jacky' 12 nan]] 



In [47]:
# for more technical information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    5 non-null      object 
 1   Age     5 non-null      int64  
 2   Rating  4 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


This methods provides us information about the following:
- The kind of data structure
- The number of entries/rows
- The row label or index ranging from 0 to 4 in this case
- Number of columns and the count of non-nulls for each column
- The number of columns with a data type
- The approximate amount of RAM used to hold the dataframe.

In [48]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [50]:
df.tail(2)

Unnamed: 0,Name,Age,Rating
3,Shalom,23,1.0
4,Jacky,12,


### Working With Tabular Data

Pandas supports the integration with many file formats or data sources like csv, excel, sql, json, parquet, etc. To get data into pandas, the function with the prefix read_ is used. To export data out of pandas into other forms we use the prefix to_*.

In [52]:
data = {
        'Name': pd.Series(['Neymar', 'Muse', 'Abu', 'Shalom', 'Jacky']),
        'Age': pd.Series([34, 56, 78,23,12]),
        'Rating': pd.Series([3.4, 5.0, 2.3, 1.0, np.nan])
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [53]:
# Writing dataframe to csv

df.to_csv('csv1.csv')

In [54]:
# Writing dataframe to csv without index

df.to_csv('csv2.csv', index=False)

In [58]:
# Writing dataframe to xlsx

df.to_excel('excel1.xl

ModuleNotFoundError: No module named 'openpyxl'

In [56]:
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable


In [57]:
help(df.to_excel)

Help on method to_excel in module pandas.core.generic:

to_excel(excel_writer: 'FilePath | WriteExcelBuffer | ExcelWriter', *, sheet_name: 'str' = 'Sheet1', na_rep: 'str' = '', float_format: 'str | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'Sequence[Hashable] | bool_t' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, startrow: 'int' = 0, startcol: 'int' = 0, engine: "Literal['openpyxl', 'xlsxwriter'] | None" = None, merge_cells: 'bool_t' = True, inf_rep: 'str' = 'inf', freeze_panes: 'tuple[int, int] | None' = None, storage_options: 'StorageOptions | None' = None, engine_kwargs: 'dict[str, Any] | None' = None) -> 'None' method of pandas.core.frame.DataFrame instance
    Write object to an Excel sheet.
    
    To write a single object to an Excel .xlsx file it is only necessary to
    specify a target file name. To write to multiple sheets it is necessary to
    create an `ExcelWriter` object with a target file name, and specify a sheet
 

In [59]:
# Writing dataframe to excel without index

df.to_excel('excel1.xlsx', sheet_name='stud_data', index=False)

ModuleNotFoundError: No module named 'openpyxl'

### Reading .xlsx file

In [60]:
import pandas as pd

df = pd.read_excel('excel2.xlsx')

FileNotFoundError: [Errno 2] No such file or directory: 'excel2.xlsx'

### Read .csv Files - Iris DataSet

In [63]:
import pandas as pd

df = pd.read_csv('Iris.csv')

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [65]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [66]:
df.tail(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
140,141,6.7,3.1,5.6,2.4,Iris-virginica
141,142,6.9,3.1,5.1,2.3,Iris-virginica
142,143,5.8,2.7,5.1,1.9,Iris-virginica
143,144,6.8,3.2,5.9,2.3,Iris-virginica
144,145,6.7,3.3,5.7,2.5,Iris-virginica
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [67]:
df.sum()

Id                                                           11325
SepalLengthCm                                                876.5
SepalWidthCm                                                 458.1
PetalLengthCm                                                563.8
PetalWidthCm                                                 179.8
Species          Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [70]:
df.sum(axis=1)

TypeError: unsupported operand type(s) for +: 'float' and 'str'

The error indicates that we are trying to perform a sum on a column that does not contain numbers. Let's fix this.

In [71]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].sum(axis=1)

0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

Others Include:
- min() and max()
- mean(), median(), var(), and std()
- count(), nunique(), unique(), and value_counts() for categorical columns

In [72]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

### Creating a new dataframe consisting only of float datatype

In [73]:
num_cols = df.select_dtypes(include=['float64']).columns

num_cols

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')

In [75]:
# subsetting

df[num_cols].median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
dtype: float64

In [76]:
# subsetting

df[num_cols].std()

SepalLengthCm    0.828066
SepalWidthCm     0.433594
PetalLengthCm    1.764420
PetalWidthCm     0.763161
dtype: float64

In [79]:
df['Species'].count()

np.int64(150)

In [80]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

print(df['Species'].unique())
print(df['Species'].nunique)

### Using describe() to Summarize the Data

In [82]:
# summarizes all the numerical columns

df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [83]:
df.describe(include=['object'])

Unnamed: 0,Species
count,150
unique,3
top,Iris-setosa
freq,50


In [84]:
df.describe(include=['number'])

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [85]:
# for all the columns

df.describe(include='all')

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,


In [86]:
df.corr()

ValueError: could not convert string to float: 'Iris-setosa'

In [87]:
df.skew()

TypeError: could not convert string to float: 'Iris-setosa'

In [88]:
df.kurt()

TypeError: could not convert string to float: 'Iris-setosa'

#### These value errors tells us that we need to select only valid columns, i.e, numerical columns in order to use these functions.

In [89]:
# Remember the num_cols we created earlier? Let's create something like that

num_cols = df.select_dtypes(include=['float64']).columns

In [90]:
df[num_cols].kurt()

SepalLengthCm   -0.552064
SepalWidthCm     0.290781
PetalLengthCm   -1.401921
PetalWidthCm    -1.339754
dtype: float64

In [91]:
df[num_cols].skew()

SepalLengthCm    0.314911
SepalWidthCm     0.334053
PetalLengthCm   -0.274464
PetalWidthCm    -0.104997
dtype: float64

In [92]:
df[num_cols].corr()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
SepalLengthCm,1.0,-0.109369,0.871754,0.817954
SepalWidthCm,-0.109369,1.0,-0.420516,-0.356544
PetalLengthCm,0.871754,-0.420516,1.0,0.962757
PetalWidthCm,0.817954,-0.356544,0.962757,1.0


### DataFrame.agg() Method

In [93]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [95]:
df.agg(
    {
      'SepalLengthCm': ['min', 'max', 'count'],
      'PetalLengthCm': ['min', 'max', 'mean', 'count'],
      'Species': ['count']
   }
)

Unnamed: 0,SepalLengthCm,PetalLengthCm,Species
min,4.3,1.0,
max,7.9,6.9,
count,150.0,150.0,150.0
mean,,3.758667,


### Accessing Data in a DataFrame Using Indexing and Slicing in Pandas DataFrame

- When selecting subsets of data, [] are use
- Inside these brackets you can use single column/row label, a list of column/row labels, a slice of labels, a conditional expression, or a colon.


### Reading the .csv File - Weather DataSet

In [6]:
df = pd.read_csv('nyc_weather2.csv')

df

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.00,0.0,0
1,2-1-2016,40,32,36.0,0.00,0.0,0
2,3-1-2016,45,35,40.0,0.00,0.0,0
3,4-1-2016,36,14,25.0,0.00,0.0,0
4,5-1-2016,29,11,20.0,0.00,0.0,0
...,...,...,...,...,...,...,...
361,27-12-2016,60,40,50.0,0,0,0
362,28-12-2016,40,34,37.0,0,0,0
363,29-12-2016,46,33,39.5,0.39,0,0
364,30-12-2016,40,33,36.5,0.01,T,0


In [7]:
print("Shape of dataset: ", df.shape)
print("Features: ", df.columns)

Shape of dataset:  (366, 7)
Features:  Index(['date', 'maximum temperature', 'minimum temperature',
       'average temperature', 'precipitation', 'snow fall', 'snow depth'],
      dtype='object')


In [8]:
df.dtypes

date                    object
maximum temperature      int64
minimum temperature      int64
average temperature    float64
precipitation           object
snow fall               object
snow depth              object
dtype: object

In [9]:
df.describe()

Unnamed: 0,maximum temperature,minimum temperature,average temperature
count,366.0,366.0,366.0
mean,64.625683,49.806011,57.215847
std,18.041787,16.570747,17.12476
min,15.0,-1.0,7.0
25%,50.0,37.25,44.0
50%,64.5,48.0,55.75
75%,81.0,65.0,73.5
max,96.0,81.0,88.5


In [11]:
df.count()

date                   366
maximum temperature    366
minimum temperature    366
average temperature    366
precipitation          366
snow fall              366
snow depth             366
dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 366 non-null    object 
 1   maximum temperature  366 non-null    int64  
 2   minimum temperature  366 non-null    int64  
 3   average temperature  366 non-null    float64
 4   precipitation        366 non-null    object 
 5   snow fall            366 non-null    object 
 6   snow depth           366 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 20.1+ KB


### Filtering Single vs Multiple Columns from a DataFrame

To select a single column, use square brackets, [] with the column name of the column interest.

In [13]:
max_temp_df = df['maximum temperature']
max_temp_df

0      42
1      40
2      45
3      36
4      29
       ..
361    60
362    40
363    46
364    40
365    44
Name: maximum temperature, Length: 366, dtype: int64

In [15]:
print('Type: ', type(max_temp_df))
print('Shape: ', max_temp_df.shape)

Type:  <class 'pandas.core.series.Series'>
Shape:  (366,)


In [16]:
# the above selection returned a series data structure, to return a pandas df let's do this

max_temp_df = df[['maximum temperature']]

max_temp_df

Unnamed: 0,maximum temperature
0,42
1,40
2,45
3,36
4,29
...,...
361,60
362,40
363,46
364,40


In [17]:
# selecting multiple columns

max_temp_df = df[['maximum temperature', 'minimum temperature']]

max_temp_df

Unnamed: 0,maximum temperature,minimum temperature
0,42,34
1,40,32
2,45,35
3,36,14
4,29,11
...,...,...
361,60,40
362,40,34
363,46,33
364,40,33


### Filtering Rows From a DataFrame

There are several ways to do this:
- Selecting rows using slicing operation <br>
df[starting_row_index:ending_row_index:step]

- Boolean indexes. To select rows based on a conditional expression, use a condition inside the [] <br>
df[condition]

In [18]:
# Using way1 (Slicing)

df[1:3]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0


In [19]:
# Using way2 (Boolean indexes)

df['maximum temperature']>90

0      False
1      False
2      False
3      False
4      False
       ...  
361    False
362    False
363    False
364    False
365    False
Name: maximum temperature, Length: 366, dtype: bool

In [None]:
# Applying