Today we will learn to create a Dataframe of Dummy data apart from reading a file. Will create some toy data to play around and will start from importing pandas as pd:

In [1]:
import pandas as pd

###### Create Dataframe by using Dictionary::

In [2]:
pd.DataFrame({'id':[100,101,102], 'color':['red','blue','red']}) # 'DataFrame' is case sensitive, both D, F should be caps.
# pandas.DataFrame: Known as dataframe constructor and will pass dictionary on constructor

Unnamed: 0,id,color
0,100,red
1,101,blue
2,102,red


Python dictionaries are key value pairs, in this case Keys are 'id' and 'color' became the column names and values became values in those series.

If we don't specify the index, the dataframe constructor will just use the default index which is integers starting at 0.

But lets say you an index and you want to specify that as an index, this is how we can do that :

In [3]:
pd.DataFrame({'id':[100,101,102], 'color':['red','blue','red']}, index=['a','b','c']) 
# can pass list of integers or strings to index

Unnamed: 0,id,color
a,100,red
b,101,blue
c,102,red


We can see that indexing is done and rows are now labelled with index 'a', 'b' & 'c'.

This is most common way to create dataframe for playing around and will save this to DF and will use again later.

In [4]:
DF= pd.DataFrame({'id':[100,101,102], 'color':['red','blue','red']}, index=['a','b','c']) 

In [5]:
DF

Unnamed: 0,id,color
a,100,red
b,101,blue
c,102,red


What if our data in different shape? Lets check what we mean:

Lets say Pandas dataframe and pass it a list of 3 lists as below:

In [6]:
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']]) #passing list of lists

Unnamed: 0,0,1
0,100,red
1,101,blue
2,102,red


Here, when we pass list of lists - each of the inner list gets treated as a row and rows get stacked one top of one another.

So we can see the difference, when we pass a dictionary the key became the column name and value became the series of values whereas if we want to do samething by passing the list of lists - we write out what's in each row and they get stacked on top of one another like above resultset. In this output we will also notice that it used not only default index but also default columns as 0 & 1, as we did above for index here we can just specify columns with their names as below:

In [7]:
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id','color'])

Unnamed: 0,id,color
0,100,red
1,101,blue
2,102,red


So here is the same dataframe as we created before but just constructed in a different way. 

Sometimes we don't have dictionaries or list of lists then we have numpy array and we want to convert numpy array to a dataframe. Lets see how we can implement that by importing Numpy::

In [8]:
import numpy as np

Here lets take an advantage of random number functionality to create an Array as below:

We have many more random number functionalities and link to documentation:
https://numpy.org/doc/stable/reference/random/index.html

In [9]:
arr = np.random.rand(4,2)

In [10]:
arr

array([[0.81324869, 0.31788305],
       [0.63187631, 0.06534513],
       [0.37188832, 0.18359047],
       [0.27095991, 0.03437462]])

Here, it has created 4rows by 2columns - numpy array of random numbers between 0 & 1 with uniform distribution. 

Lets convert above array into a dataframe which is by passing name of the array to the dataframe constructor:

In [11]:
pd.DataFrame(arr)

Unnamed: 0,0,1
0,0.813249,0.317883
1,0.631876,0.065345
2,0.371888,0.18359
3,0.27096,0.034375


By default, they won't have any column names - will manually add them like before as here:

In [12]:
pd.DataFrame(arr, columns=['one','two'])

Unnamed: 0,one,two
0,0.813249,0.317883
1,0.631876,0.065345
2,0.371888,0.18359
3,0.27096,0.034375


So this is how we can convert a numpy array to a dataframe. 

###### Next, Lets see how to create a toy dataset / large dataframe without typing huge amount of text ::

Lets do that by taking the advantage of some more numpy functionalities, in fact there are many ways to implement this but out target is to create a dataframe of 10 rows and 2 columns. 

In [13]:
pd.DataFrame({'student':np.arange(100, 110, 1),'test':np.random.randint(60, 101, 10)}) 
# Student id's from 100 to 110 by 1 and the second column is called 'test' with test scores - 
# which is random integers and we want random integers 60 to 101, 10 of them.

Unnamed: 0,student,test
0,100,78
1,101,76
2,102,70
3,103,69
4,104,77
5,105,85
6,106,71
7,107,83
8,108,89
9,109,86


Here, passing a dictionary and first key will be student(their IDs) and instead of passing list, we are passing student's test score as numpy arange() stands for array range similar to python range function except a range outputs a numpy array.

np.arrange(): Return evenly spaced values within a given interval.
https://numpy.org/doc/stable/reference/generated/numpy.arange.html

numpy.random.randint(): Return random integers from low (inclusive) to high (exclusive).
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

And the above resultset says, range its inclusive of the first item and exclusive of the last and here is our dataframe with 'student' and 'test' became column names with default index. 

student column is 100 through 109 by 1's and test scores are random integers between 60 and 100.

other random integer functionalities might be useful while creating these dataframes.

*** One more thing about this Dataframe construction is, whenever we are creating a Dataframe we can actually chain it together with .set_index if we have one of the column to set as index. 

In [14]:
pd.DataFrame({'student':np.arange(100, 110, 1),'test':np.random.randint(60, 101, 10)}).set_index('student')

Unnamed: 0_level_0,test
student,Unnamed: 1_level_1
100,68
101,72
102,72
103,69
104,70
105,69
106,68
107,77
108,69
109,76


Above output says, student is our index and test is only column or one of the columns. 

###### Useful tip: Create a  New Series and attach to existing Dataframe::

Lets use Series constructor - 

In [15]:
s = pd.Series(['round', 'square'], index=['c','b'], name = 'shape')

In [16]:
s

c     round
b    square
Name: shape, dtype: object

Above is the Series we have created, first thing passed in Series method become values of the Series and the second things passed 'index' - the Series index obviously becomes index of the Series(identifier of the rows) and the third thing is name which is identifies of the Series. So this is the Series we created now.

Lets see how Dataframe look like before we attach a new Series to it:

In [17]:
DF

Unnamed: 0,id,color
a,100,red
b,101,blue
c,102,red


This is dataframe, we have got id and color and a, b, c. Lets combine a Series with a dataframe is to concatenate them using pandas concate(), a top-level function and pass a list of objects to concatenate:

In [18]:
pd.concat([DF, s], axis=1)
#concatenate side by side, so axis =1 whereas if we wanted to concatenate rows then axis=0

Unnamed: 0,id,color,shape
a,100,red,
b,101,blue,square
c,102,red,round


3 things to notice: 
-> Name of the Series, became the column name when added to DF.
-> Series is aligned to DF by index, b is square and c is round.
-> index a says shape is NaN which is missing value.