# Altair - Vega Graph 

Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite.
Altair offers a powerful and concise visualization grammar that enables you to build a wide range of statistical visualizations quickly.

https://altair-viz.github.io/getting_started/overview.html

### Specifying Data in Altair

Each top-level chart object (i.e. Chart, LayerChart, and VConcatChart, HConcatChart, RepeatChart, FacetChart) accepts a dataset as its first argument. The dataset can be specified in one of the following ways:

    as a Pandas DataFrame
    as a Data or related object (i.e. UrlData, InlineData, NamedData)
    as a url string pointing to a json or csv formatted text file
    as an object that supports the __geo_interface__ (eg. Geopandas GeoDataFrame, Shapely Geometries, GeoJSON Objects)
  
https://altair-viz.github.io/user_guide/data.html

**The motive of this exercise is to generate as many graphs as possible from the given dataset using Altair. Currently we are focusing to generate graph with all the available columns (except for string data types). We are automating the process of JSON generation which will then be used to generate image in Visual Studio code.**

In [1]:
import pandas as pd
import numpy as np
import json
import os

#import  Altair API  
import altair as alt

In [2]:
df = pd.read_csv('Titanic_full.csv')

In [3]:
#df = pd.read_csv('AB_NYC_2019.csv')

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 122.8+ KB


In [6]:
df.shape

(1309, 12)

In [7]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
#Calculating null values
nulls_count = {col: df[col].isnull().sum() for col in df.columns} 
print(nulls_count)

{'PassengerId': 0, 'Survived': 0, 'Pclass': 0, 'Name': 0, 'Sex': 0, 'Age': 263, 'SibSp': 0, 'Parch': 0, 'Ticket': 0, 'Fare': 1, 'Cabin': 1014, 'Embarked': 2}


## Data Cleaning

In [9]:
# We are dropping  Columns which have more than 30% of null value
# Repalcing null value with mean in case of int and float
# If null value persist for other cases we are dropping those rows

is_null_count_out_of_range = {col: df[col].isnull().sum()/df.shape[0] *100 > 30 for col in df.columns}

for k,v in is_null_count_out_of_range.items():
    if v:
        df.drop( k,axis=1,inplace=True )
    else:
        if isinstance(df[k][0], (np.int64, np.float64)) :
            df[k].fillna(df[k].mean(), inplace=True)
        else :
            drop_list = df[df[k].isnull()].index.tolist()
            df.drop( drop_list,axis=0,inplace=True  )
            
     
    nulls_count = {col: df[col].isnull().sum() for col in df.columns}
    
print(nulls_count)

{'PassengerId': 0, 'Survived': 0, 'Pclass': 0, 'Name': 0, 'Sex': 0, 'Age': 0, 'SibSp': 0, 'Parch': 0, 'Ticket': 0, 'Fare': 0, 'Embarked': 0}


## Data type conversion

In [10]:
#Finding all the unique values to in each column
uniques = {col: df[col].unique().tolist() for col in df.columns}

print(uniques)

{'PassengerId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 

In [11]:
# Directory 
directory = "Altair_Plots"
  
# Parent Directory path 
parent_dir = "../"
  
# Path 
path = os.path.join(parent_dir, directory) 

try:  
    os.mkdir(path)  
except OSError as error:  
    print(error)

[WinError 183] Cannot create a file when that file already exists: '../Altair_Plots'


In [12]:
#Writing the above created dictionary to a text file

with open(path +'/Unique_values.txt', 'w') as json_file:
      json.dump(uniques, json_file)

In [13]:
#checking for null value
df.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Embarked       False
dtype: bool

In [14]:
df.shape

(1307, 11)

In [15]:
# Identifying number of unique values in each column 
for k,v in uniques.items():
    v = pd.Index(v)
    print(k +' : '+ str(len(v)))

PassengerId : 1307
Survived : 2
Pclass : 3
Name : 1305
Sex : 2
Age : 99
SibSp : 7
Parch : 8
Ticket : 928
Fare : 281
Embarked : 3


In [16]:
#Converting columns to categorical having less than or equal to 10 
#unique values in a cloumn

for k,v in uniques.items():
    if len(pd.Index(v)) <=10:
        df[k]=df[k].astype('category')
        
df.dtypes

PassengerId       int64
Survived       category
Pclass         category
Name             object
Sex            category
Age             float64
SibSp          category
Parch          category
Ticket           object
Fare            float64
Embarked       category
dtype: object

In [17]:
cat_col=df.select_dtypes(include=['category']).columns.tolist()
num_col=df.select_dtypes(include=['int64','float64']).columns.tolist()

print("Numerical Column : '\n'"+str(num_col))
print("Categorical Column : '\n'"+str(cat_col))

Numerical Column : '
'['PassengerId', 'Age', 'Fare']
Categorical Column : '
'['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']


## Altair JSON generation

In [18]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [21]:
alt.Chart(df).mark_bar().encode( x='Pclass',y='Fare')

In [22]:
# Generating JSON using altair methods

for i in range(len(cat_col)):
    for j in range(len(num_col)): 
        chart=alt.Chart(df).mark_bar().encode( x=cat_col[i],y=num_col[j])
        chart.save(path+'/'+str(cat_col[i])+" Vs "+str(num_col[j])+"_"+"plot.json")
        #print(cat_col[i],num_col[j])
print("JSON generated in ""Altair_Plots"" folder for the combinations")

JSON generated in Altair_Plots folder for the combinations


<div class="alert alert-block alert-info">
    <b>Copyright</b> 2020 Srushti Dhamangaonkar and Hung-Chih Huang<br>
    <br>Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:<br>
    <br>The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.<br>
    <br>THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
    <br><br>
    
<div class="text-center">
    <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/3.0/us/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.<br>
</div></div>