# Python Tutorial : Day 2
Here's is what we are going to do today : **Summarizing Data**
1. [df.info()](#1)
2. [df.shape()](#2)
3. [df.size()](#3)
4. [df.ndim()](#4)
5. [df.index](#5)
6. [df.columns](#6)
7. [df.count()](#7)
8. [df.sum()](#8)
9. [df.cumsum()](#9)
10. [df.min()](#10)
11. [df.max()](#11)
12. [df.idxmin()](#12)
13. [df.idxmax()](#13)
14. [df.describe()](#14)
15. [df.mean()](#15)
16. [df.median()](#16)
17. [df.quantile()](#17)
18. [df.var()](#18)
19. [df.std()](#19)
20. [df.cummax()](#20)
21. [df.cummin()](#21)
22. [df['columnName'].cumprod()](#22)
23. [len(df)](#23)
24. [df.isnull()](#24)
25. [df.corr()](#25)

Let's get started!

[Daily news for stock market prediction](https://www.kaggle.com/aaron7sun/stocknews)

When we come up with a new dataset, the first thing we do is to analyze the dataset. Here are different inbuilt python functions which are very helpful in getting the summary of the dataset.There are lots of inbuilt functions in python but I have choosen some of the very important ones. Let's go through each functions one-by-one.

In [1]:
# import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/stocknews/Combined_News_DJIA.csv
/kaggle/input/stocknews/RedditNews.csv
/kaggle/input/stocknews/upload_DJIA_table.csv


In [2]:
# import data
df = pd.read_csv("/kaggle/input/stocknews/upload_DJIA_table.csv")

In [3]:
# looking at the top five rows of the data
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [4]:
# looking at the bottom five rows of the data
df.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1984,2008-08-14,11532.070312,11718.280273,11450.889648,11615.929688,159790000,11615.929688
1985,2008-08-13,11632.80957,11633.780273,11453.339844,11532.959961,182550000,11532.959961
1986,2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727
1987,2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609
1988,2008-08-08,11432.089844,11759.959961,11388.040039,11734.320312,212830000,11734.320312


## Summarize Data
It is easy to get information abut the data using pandas. Let's examine the inbuilt functions one-by-one :

### 1) **df.info()**<a id="1"></a>
This code provides detailed information about the data. This contains : 
   * RangeIndex : Specifies number of entries in the dataset
   * Data Columns : Specifies total number of columns
   * Columns : Gives information anout each column
   * dtypes : Specifies the datatype of each column
   * Memory Usage : Describes memory usage

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 7 columns):
Date         1989 non-null object
Open         1989 non-null float64
High         1989 non-null float64
Low          1989 non-null float64
Close        1989 non-null float64
Volume       1989 non-null int64
Adj Close    1989 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 108.9+ KB


### 2) **df.shape**<a id="2"></a>
It returns tuple of shape (ROWS, COLUMNS) of dataframe/series.

In [6]:
df.shape

(1989, 7)

### 3) **df.size**<a id="3"></a>
It returns size of dataframe/series which is equivalent to total number of elements. That is rows x columns.

In [7]:
df.size

13923

### 4)  **df.ndim**<a id="4"></a>
Returns dimension of dataframe/series. 1 for one dimension (series), 2 for two dimension (dataframe)

In [8]:
# dataframe
df.ndim

2

In [9]:
# for series
df['Date'].ndim

1

### **5) df.index** <a id="5"></a>
Returns total number of index found

In [10]:
df.index

RangeIndex(start=0, stop=1989, step=1)

### 6) **df.columns**<a id="6"></a>
Returns all the columns in the dataset

In [11]:
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

### 7) **df.count()**<a id="7"></a>
It is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.

**Syntax: ** 

DataFrame.count(axis=0, level=None, numeric_only=False)

**Parameters:**
* **axis :** 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
* **level :** If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame
* **numeric_only :** Include only float, int, boolean data
* **Returns: count :** Series (or DataFrame if level specified)

In [12]:
df.count()

Date         1989
Open         1989
High         1989
Low          1989
Close        1989
Volume       1989
Adj Close    1989
dtype: int64

### 8) **df.sum()**<a id="8"></a>
Pandas dataframe.sum() function return the sum of the values for the requested axis. If the input is index axis then it adds all the values in a column and repeats the same for all the columns and returns a series containing the sum of all the values in each column. It also provides support to skip the missing values in the dataframe while calculating the sum in the dataframe.

**Syntax:** 
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

**Parameters :**
* **axis :** {index (0), columns (1)}
* **skipna :** Exclude NA/null values when computing the result.
* **level :** If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
* **numeric_only :** Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
* **min_count :** The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
* **Returns :** sum : Series or DataFrame (if level specified)

In [13]:
df.sum()

Date         2016-07-012016-06-302016-06-292016-06-282016-0...
Open                                               2.67702e+07
High                                               2.69337e+07
Low                                                2.65988e+07
Close                                               2.6778e+07
Volume                                            323831020000
Adj Close                                           2.6778e+07
dtype: object

### 9) **df.cumsum()**<a id="9"></a>
Returns a DataFrame or Series of the same size containing the cumulative sum.

A **cumulative sum** is a sequence of partial sums of a given sequence. For example, the cumulative sums of the sequence {a,b,c,...}, are a, a+b, a+b+c, ...

In [14]:
df.cumsum().head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.2,18002.4,17916.9,17949.4,82160000,17949.4
1,2016-07-012016-06-30,35637.0,35933.0,35628.7,35879.4,215190000,35879.4
2,2016-07-012016-06-302016-06-29,53093.0,53637.5,53084.7,53574.0,321570000,53574.0
3,2016-07-012016-06-302016-06-292016-06-28,70283.5,71047.2,70275.2,70983.8,433760000,70983.8
4,2016-07-012016-06-302016-06-292016-06-282016-0...,87638.7,88402.4,87338.3,88124.0,572500000,88124.0


### 10) **df.min()**<a id="10"></a>
Returns the minimum of the values in the given object.

In [15]:
df.min()

Date         2008-08-08
Open            6547.01
High            6709.61
Low             6469.95
Close           6547.05
Volume          8410000
Adj Close       6547.05
dtype: object

### 11) **df.max()**<a id="11"></a>
Returns the maximum of the values in the given object.

In [16]:
df.max()

Date         2016-07-01
Open            18315.1
High            18351.4
Low             18272.6
Close           18312.4
Volume        674920000
Adj Close       18312.4
dtype: object

### 12) **df.idxmin()**<a id="12"></a>
idxmin() function returns index of first occurrence of minimum over requested axis.While finding the index of the minimum value across any index, all NA/null values are excluded.

**Syntax:** DataFrame.idxmin(axis=0, skipna=True)

**Parameters :**
* **axis :** 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
* **skipna :** Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns : idxmin : Series

In [17]:
df['Open'].idxmin()

1842

### 13) **df.idxmax()**<a id="13"></a>
idxmax() function returns index of first occurrence of maximum over requested axis.

In [18]:
df['Open'].idxmax()

282

### 14) **df.describe()**<a id="14"></a>
This Code provides basic statistical information about the data. The numerical column is based.

* **count:** number of entries
* **mean:** average of entries
* **std:** standard deviation
* **min:** minimum entry
* **25%:** first quantile
* **50%:** median or second quantile
* **75%:** third quantile
* **max:** maximum entry

In [19]:
df.describe()

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
count,1989.0,1989.0,1989.0,1989.0,1989.0,1989.0
mean,13459.116048,13541.303173,13372.931728,13463.032255,162811000.0,13463.032255
std,3143.281634,3136.271725,3150.420934,3144.006996,93923430.0,3144.006996
min,6547.009766,6709.609863,6469.950195,6547.049805,8410000.0,6547.049805
25%,10907.339844,11000.980469,10824.759766,10913.379883,100000000.0,10913.379883
50%,13022.049805,13088.110352,12953.129883,13025.580078,135170000.0,13025.580078
75%,16477.699219,16550.070312,16392.769531,16478.410156,192600000.0,16478.410156
max,18315.060547,18351.359375,18272.560547,18312.390625,674920000.0,18312.390625


### **15) df.mean()** <a id="15"></a>
This code returns the mean value for the numeric column.

In [20]:
df.mean()

Open         1.345912e+04
High         1.354130e+04
Low          1.337293e+04
Close        1.346303e+04
Volume       1.628110e+08
Adj Close    1.346303e+04
dtype: float64

### **16) df.median()** <a id="16"></a>
This code returns median for columns with numeric values.

In [21]:
df.median()

Open         1.302205e+04
High         1.308811e+04
Low          1.295313e+04
Close        1.302558e+04
Volume       1.351700e+08
Adj Close    1.302558e+04
dtype: float64

### **17) df.quantile()**<a id="17"></a>
df.quantile() function return values at the given quantile over requested axis.

**Note : **In each of any set of values of a variate which divide a frequency distribution into equal groups, each containing the same fraction of the total population.

In [22]:
df.quantile([0.25,0.75])

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
0.25,10907.339844,11000.980469,10824.759766,10913.379883,100000000.0,10913.379883
0.75,16477.699219,16550.070312,16392.769531,16478.410156,192600000.0,16478.410156


### **18) df.var()** <a id="18"></a>
Returns the variance for each column with a numeric value.

In [23]:
df.var()

Open         9.880219e+06
High         9.836200e+06
Low          9.925152e+06
Close        9.884780e+06
Volume       8.821610e+15
Adj Close    9.884780e+06
dtype: float64

### **19) df.std()**<a id="19"></a>
Returns the standard deviation for each column with numeric value.

In [24]:
df.std()

Open         3.143282e+03
High         3.136272e+03
Low          3.150421e+03
Close        3.144007e+03
Volume       9.392343e+07
Adj Close    3.144007e+03
dtype: float64

### **20) df.cummax()**<a id="20"></a> 
Calculates the cumulative max value between the data.

In [25]:
df.cummax().head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.2,18002.4,17916.9,17949.4,82160000,17949.4
1,2016-07-01,17924.2,18002.4,17916.9,17949.4,133030000,17949.4
2,2016-07-01,17924.2,18002.4,17916.9,17949.4,133030000,17949.4
3,2016-07-01,17924.2,18002.4,17916.9,17949.4,133030000,17949.4
4,2016-07-01,17924.2,18002.4,17916.9,17949.4,138740000,17949.4


### **21) df.cummin()** <a id="21"></a>
Calculates the cumulative min value between the data.

In [26]:
df.cummin().head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.2,18002.4,17916.9,17949.4,82160000,17949.4
1,2016-06-30,17712.8,17930.6,17711.8,17930.0,82160000,17930.0
2,2016-06-29,17456.0,17704.5,17456.0,17694.7,82160000,17694.7
3,2016-06-28,17190.5,17409.7,17190.5,17409.7,82160000,17409.7
4,2016-06-27,17190.5,17355.2,17063.1,17140.2,82160000,17140.2


### **22) df['columnname'].cumprod()**<a id="22"></a>
Returns the cumulative production of the data.

In [27]:
df['Open'].cumprod().head()

0    1.792424e+04
1    3.174878e+08
2    5.542073e+12
3    9.527105e+16
4    1.653449e+21
Name: Open, dtype: float64

### **23) len(df)**<a id="23"></a>
Returns the number of entries in the dataset

In [28]:
len(df)

1989

### **24) df.isnull()** <a id="24"></a>
Checks for null values, returns boolean.

In [29]:
df.isnull().head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False


### **25) df.corr()** <a id="25"></a>
It gives information about the correlation between the data.

In [30]:
df.corr()

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
Open,1.0,0.999592,0.999436,0.998991,-0.691621,0.998991
High,0.999592,1.0,0.999373,0.999546,-0.686997,0.999546
Low,0.999436,0.999373,1.0,0.999595,-0.699572,0.999595
Close,0.998991,0.999546,0.999595,1.0,-0.694281,1.0
Volume,-0.691621,-0.686997,-0.699572,-0.694281,1.0,-0.694281
Adj Close,0.998991,0.999546,0.999595,1.0,-0.694281,1.0


FOLLOW FOR MORE TUTORIALS!!!